I have been looking some codebases that use SIMD intrinsics and other technics to get better results at doing things. It got me thinking recently... There's a codebase that I interact often that has lots of for usages just like in the subject, just to traverse a vector/array and do things. These iterations are independent, i=n and i=n+1 don't have any particular order that matters. Is there a quick refactor I can apply to these cases that will give me better performance? Or is GCC/Clang/MSVC already doing optimizations in this case?
range based could be faster for some code. it depends on the code!
these compilers already do a lot -- they unroll small loops, optimize the loop variable, optimize away the n+1 (happens once not every iteration), and probably prefetch / branch predict that the loop will go again and only pay the price of being wrong once (when the loop ends).
For MSVS, under Properties/c++/Code Generation there's Enable Enhanced Instruction Set that allows SIMD and Vector Extensions to be set if the processor supports them. Also Enable Parallel Code Generation.
Also for MSVS, make sure that under Properties/C++/Optimization for Optimization you have Favour Speed and for Favour Size Or Speed you have Favour Fast Code. Also that Intrinsic Functions are enabled.
Parallel algorithms in the standard library or libraries like TBB incur significant overhead, so blindly replacing all your loops with parallel versions of std::for_each is not guaranteed to improve performance. Make sure to measure.
It's been shown in articles by cppstories that simply replacing std::algorithm functions with their parallel version (from C++17) can actually degrade performance - due to the overhead of constructing the parallel threads, sync etc. They are not the silver bullet to solve performance issues.