While trying to do some micro-optimization of my code, I came across the memcpy function. Since part of my code involves copying "large" arrays of ints and none of the copies could be combined into a single loop, I thought memcpy would provide some speed gain. Oddly, when comparing memcpy versus a for-loop of 10kk ints, the for loop came out faster.
This is the testcode: (a, b are int[icount], a is filled with random integers)
1 2 3 4
for (int j = 0; j < 50; ++j) {
memcpy(b, a, icount*sizeof(int));
// for (int i = 0; i < icount; ++i) b[i] = a[i];
}
Results: memcpy 0.67sec vs for 0.59sec.
I was expecting them to be very close to each other [in release mode], at the most being equal due to the compiler realizing that it could replace the for copy by the same statements as memcpy. I definitely did not expect the for to be faster.
So why is this? Am I using memcpy wrong, or is there something else I'm missing?
Even then, why would it be faster than memcpy?
Unrolling removes/decreases the overhead from for, reducing the computations to "take 4bytes from point A and put them at point B"*n, which should still be slower than "take 4*n bytes from point A and put them at point B"*1.
For regular int arrays, the std::copy method seems a bit too generic/fancy. I'll test it as well just to see.
The loop on line 1 is just to repeat the test 50 times for less randomness and more precision.
[edit]
std::copy has the same runtime as memcpy (0.67sec).
Another oddity: I heard a "common" optimization trick for loops was reversing the order (> 0 being faster than < const). Reorganizing the for loop to this: for (i = icount; --i > 0;) b[i] = a[i];
leads to a runtime of... (drum roll): 0.67sec. The increment-for still clocks at 0.56~0.59sec.
Durations in milliseconds for 230000000 bytes of data:
memcpy() 250
for loop 109
std::copy() 125
Press any key to continue . . .
The values change every time the program is run but they're consistent: for() is fastest.
So it seems std::copy() in MSVC is better optimized?
As for memcpy(), I think I understand why it's slower: it is type-agnostic.
It receives two void pointers (meaning "I don't care what type they are") and the number of bytes it has to copy.
Therefore the only straightforward implementation for it can be a loop that copies the elements byte by byte.
As for std::copy() I think there's a small overhead due to it being a template... although not in your case, it seems?
My original testing code only used one method per test, so no "in-runtime" things can influence the other test.
I don't see why being type-agnostic would make it slower. As I see it, the instructions boil down to "get memory block [start, end] and copy it. A for loop can make no assumptions; it can't even be sure the blocks are side by side (e.g. I could tell it to only copy every second item). In terms of overhead, 1*n should be much better than n*1, no?
Catfish, yes it's a cache issue. Going through the arrays once before the tests should give more reliable results. It looks like the for loop and std::copy is about as fast. memcpy is faster but remember that memcpy doesn't give any guarantees if the ranges are overlapping. This could be the reason why memcpy can be much faster.
Would it not depend on the algorithm used in memcpy? A naive version may just copy a byte at a time, irrespective of the size of the memory bus. Maybe it is optimized in a way that is not as optimal as your for loop for that specific task. Maybe there is more code in there to make it more secure? Knowing Microsoft, the actual function is hidden behind several defines and the one you get will depend on your settings.
If you are copying lots of data in one go, use memcpy. If you are copying small amounts of data, use for loops. It takes time for the program to begin the memcpy, where as for loops are built in to native c++.
In my experience, memcpy() is oftenoccasionally the slowest solution because it's a precompiled library function, a black box: the C++ compiler has no access to its source code and cannot use context-dependent optimizations. On the other hand, std::copy() and its more cumbersome equivalent, the for loop, are presented to the compiler as source code, and the compiler is free to rearrange it any way it feels like.
In addition, at least on my linux, memcpy() is compiled to be backwards compatible with older CPU models, while the C++ compiler compiles std::copy() and for loops using SSE instructions.
When in doubt, look at the assembly language output.
memcpy() is often the slowest solution because it's a precompiled library function, a black box: the C++ compiler has no access to its source code and cannot use context-dependent optimizations.
This argument doesn't really hold up to scrutiny. For example, it doesn't explain why the default copy constructor for PODs often calls memcpy(). If the compiler can really generate faster code than memcpy(), that would be an ideal place to do it.
When in doubt, look at the assembly language output.
My compiler calls memmove() for std::copy() with arrays of chars, and generates a naive loop with a couple MOVS instructions for std::copy() with arrays of this:
1 2 3
struct POD{
char a[10];
};
The latter is over ten times slower than memcpy(). Turning on SIMD instruction sets optimizations doesn't do anything.
No, of course I took something as obvious as sizes into consideration.
However, I made a mistake in the benchmark that made the compiler remove a whole chunk of code, which is why the array of PODs was so much faster. I'll have to trick it into behaving with I/O.
EDIT: Wait, no. I forgot to multiply by sizeof(T). memcpy() is still nearly twice as fast, though.