I have this function that I need to speed up, and I think that I can use SSE to do this. Unfortunately (for me), I have no experience with SSE whatsoever, other than what I have been able to read in the last day or so.
The machine that I am working on is an IA32 machine with SSE3. Can somebody see where I might use SSE to speed this function up? Thanks.
Using SSE for a general case for all values of size might be tricky. However, what you can do is to have a separate function for e.g. size=4, which can be vectorized nicely using SSE2 (provided size=4 is a common case, multiples of 4 also work).
In fact, gcc 4.6 does this automatically at -O3 and achieves a speedup of factor 4 compared to the non-SSE variant (at -O2) and a speedup of 6.3x to the general case at -O3, when it isn't known at compile-time that size=4.
SSE2 is fine. I have no understanding yet of the differences between the two. I did compile this using an -O3 flag, and saw modest performance improvements. I'd like to incorporate SSE explicitly in my code, if I can without too steep of a learning curve.
Can you suggest how I might change my code above? Thanks for the SSE2 suggestion.
#include <emmintrin.h>
[...]
__m128i zero=_mm_set1_epi8(0);
__m128i sum=zero;
[...]
//(inside the loop):
__m128i vals8=_mm_loadu_si128(reinterpret_cast<__m128i*>(&input -> color[plane][row + i - 1][col + j - 1]));
__m128i vals16=_mm_unpacklo_epi8(vals8,zero);
__m128i vals32=_mm_unpacklo_epi16(vals16,zero);
sum=_mm_add_epi32(sum,vals32);
[...]
int sums[4];
_mm_storeu_si128(reinterpret_cast<__m128i*>(sums),sum);
int value = (sums[0] + sums[1] + sums[2] + sums[3]) / divisor;
This doesn't improve performance much and makes the out-of-bounds accesses worse by reading 16 bytes at a time (when only 4 are needed) and discarding the upper 12 bytes in the following two unpack instructions.
Like I mentioned, serious improvements can be expected when having different subfunctions with a constant, specific filter size (read: a template function with the size as the template parameter). Judging from the inner loop, size appears to be a multiple of 4, so this might be a viable solution.