Perhaps another statement from that chapter of C++ Coding Standards:
"It is far, far easier to make a correct program fast than it is to make a fast program correct."
In some systems/applications speed is paramount, but those are relatively few in number compared to those
in which speed is not important. And even in those applications where speed is required, not every algorithm need
be as fast as possible. It makes very little difference what sorting algorithm you use to sort an array of 10 elements
if you only do it once.
However, correctness, simplicity, and clarity are always a priority, because they are relevant to all programs.
Spending extra brain cells trying to optimize an algorithm that would run perfectly fine unoptimized is just a
waste of time--time that could have been spent solving other things that are actually problems.
Regarding the actual algorithms provided, +1 moorecm. Because std::fill_n is templatized, it can be specialized
for various data types in order to provide optimized solutions.
@rollie,
Are jumps really that slow? Anyway, this isn't really the same function, since it wont work when n%4 != 0. Also, line 2 is probably not what you were thinking about.
I think the fastest possible way to do this would be
@Bazzy,
I disagree. I just think you should profile your code before micro-optimizing it, because you're more likely to slow it down if you don't check where your bottlenecks really are first.
@hamsterman,
I think inline asm is to be avoided when possible. If you need to use asm in a program (e.g. you're writing a bootloader or operating system) you can put it in a separate function, put a forward declaration in a header file, and then call it separately.
Example:
main.c
1 2 3 4 5 6
externint multiply(int a, int b);
int main()
{
printf("2 * 2 = %d", multiply(2, 2));
}
@hamsterman
The loop instruction is very slow, so you'll find this method is not very fast in comparison.
One should use the most obvious method - this is either fill_n or #1. The simpler the code, the higher the chance the compiler can transform it to something more adequate. For SSE targets, something "more adequate" is an SSE-optimized loop, which can be assumed to be equivalent to the memset implementation on such a target.
Like Wolf said, the second one doesn't offset the array every time so it is technically faster...but the difference here is soooooooooooo minimal anyway it'd be stupid to say it's 'omg micro-optimized code'.
There is no difference - at least on x86 targets, since they support addressing such as [eax+ebx*4+...].
The result that ROmai's code is the slowest one is quite surprising.
I think that GCC perform this optimization by recognizing some common patterns in the code - the first two loops are very typical ones. The last time I checked, compilers had some problems with optimizing also the loop with pointers - at least gcc 4.2 could not unroll it and use SSE. Now things improved a lot, but still the reverse while loop needs more care.
What do you mean? I don't think SSE has something for loops (I don't know much about it though). Or do you mean unrolling?
One should use the most obvious method
I absolutely agree. My point was that if someone is feeling paranoid about optimisations (and I'm not saying one should be), it would be logical to use assembly (for such trivial tasks at least).
What do you mean? I don't think SSE has something for loops (I don't know much about it though). Or do you mean unrolling?
movaps %xmm0, (%edi,%eax)
Loop unrolling + SSE enables to process more than one item per CPU cycle. This is called loop vectorization. To give the compiler enough room, you should always write the most obvious code as possible.
It is bad idea to write worst code with hope on compiler optimizer.
My answer is the second code block not slower than first in most cases.
When i use optimization it can be size optimization and compiler may don't try
to optimize for speed this code blocks.
Look at *array++ = 0; construction:
don't forget different processor architectures - e.g. AVR has "Store Indirect and Post-Increment" instruction and compiler will use it in this case. array[i] = 0; will add i value on every iteration.
Among other loops do-while-loop is faster because no need to test condition before loop body.
About hand written optimization is it depends on skills of programmer and particular case.
This code is simpler:
1 2 3 4
void zeroArray(int* array, int n) {
while (n--)
*array++ = 0;
}
@xorebxebx,
Thanks. It would probably be good for me to learn some SSE stuff..
@boolivar,
Something to learn form this thread is that compilers are smarter that you think.
Using SIMD instruction sets can provide a massive optimization; of course, there's no remedy for code that's bad anyway. As an example, I know a 2D game that uses SSE3 and a bunch of different optimization flags on the GCC command-line, and it runs at about 50 FPS on my computer. But as soon as you start drawing particles, the game slows down massively (it goes from ~50 FPS to 27 FPS) because the code for the particle simulator is horrible. Nested loops everywhere, it's illegible.
I was going to fix it but I can't be bothered to fix all the code. The original author put it all in one file, too, so as soon as you start moving functions into more appropriate places you break all kinds of dependancies. I wanted to make it somewhat modular and structured (the way I usually write programs is that there is one file - main.c - containing the main() function, and then a bunch of separate, isolated modules that don't even know about each other, and the main() function just uses those as it needs to), the MINIX kernel is an excellent example of a very well structured program. But the (game) code is full of horrible, circular dependancies. I'd rewrite it from scratch, but I don't know that I'd be able to do the particle physics.
See for yourself. When using loop instead of sub-jne, each loop iteration takes 4 instead of 2 CPU cycles.
Although just the fact that modern compilers don't use the loop instruction at all (not even when optimizing for size) should tip you off that loop is bad news.
hamsterman wrote:
My point was that if someone is feeling paranoid about optimisations (and I'm not saying one should be), it would be logical to use assembly (for such trivial tasks at least).
That wouldn't be a good idea, because hand-written assembler code is usually slower than what the compiler generates.
The compiler has detailed knowledge about the CPU that it is targeting - it knows how many cycles it takes for a particular instruction to complete and it can reorder instructions so that there are as few pipeline stalls as possible. An average programmer doesn't have the necessary knowledge to produce assembler code of similar quality.
This only shows that gcc is a poor compiler for avr or you compiled it without proper optimization options. Besides, they don't spend so much time writing optimizations for AVR as they are for Intel (and despite of that, I've heard many times they still loose to the Intel compiler team).