Using SSE to improve performance

I have this function that I need to speed up, and I think that I can use SSE to do this. Unfortunately (for me), I have no experience with SSE whatsoever, other than what I have been able to read in the last day or so.

The machine that I am working on is an IA32 machine with SSE3. Can somebody see where I might use SSE to speed this function up? Thanks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
double
applyFilter(struct Filter *filter, cs1300bmp *input, cs1300bmp *output)
{

  long long cycStart, cycStop;

  cycStart = rdtscll();

  output -> width = input -> width;
  output -> height = input -> height;

  long width = input -> width - 1;
  long height = input -> height - 1;
  long size = filter -> getSize();
  int divisor = filter -> getDivisor() + 1;

  for(int plane = 0; plane < 3; plane++) {
    for(int row = 1; row < height; row++) {
      for(int col = 1; col < width; col++) {
	int val0 = 0;
	int val1 = 0;
        int val2 = 0;
        int val3 = 0;
	int iFilter = 0;

	for (int i = 0; i < size; i++) {
	  for (int j = 0; j < size; j+=4) {
	    iFilter = myFilter [i][j];
	    val0 = val0 + input -> color[plane][row + i - 1][col + j - 1];
	    val1 = val1 + input -> color[plane][row + i - 1][col + j];
	    val2 = val2 + input -> color[plane][row + i - 1][col + j + 1];
	    val3 = val3 + input -> color[plane][row + i - 1][col + j + 2];
	  }
	}

        int value = (val0 + val1 + val2 + val3) / divisor;
       
        if ( value < 0 ) { value = 0; }
        if ( value  > 255 ) { value = 255; }
        output -> color[plane][row][col] = value;
      }
    }
  }

  cycStop = rdtscll();
  double diff = cycStop - cycStart;
  double diffPerPixel = diff / (output -> width * output -> height);
  fprintf(stderr, "Took %f cycles to process, or %f cycles per pixel\n",
	  diff, diff / (output -> width * output -> height));
  return diffPerPixel;
}
Hi, don't know about SSE. But you are doing the same operations (with the same values) several times
Maybe you could precompute that.
1
2
3
4
5
6
7
8
9
	for (int i = 0; i < size; i++) {
	  for (int j = 0; j < size; j+=4) {
	    //iFilter = myFilter [i][j]; //¿?
	    val0 = val0 + input -> color[plane][row + i - 1][col + j - 1];
	    val1 = val1 + input -> color[plane][row + i - 1][col + j];
	    val2 = val2 + input -> color[plane][row + i - 1][col + j + 1];
	    val3 = val3 + input -> color[plane][row + i - 1][col + j + 2];
	  }
	}
The rectangles overlap.
And note that when you move to the next column v0' = v1; v1' = v2; v2' = v3 (a queue)

Also, I think that you are accessing out of bounds.
Last edited on
Using SSE for a general case for all values of size might be tricky. However, what you can do is to have a separate function for e.g. size=4, which can be vectorized nicely using SSE2 (provided size=4 is a common case, multiples of 4 also work).
In fact, gcc 4.6 does this automatically at -O3 and achieves a speedup of factor 4 compared to the non-SSE variant (at -O2) and a speedup of 6.3x to the general case at -O3, when it isn't known at compile-time that size=4.
SSE2 is fine. I have no understanding yet of the differences between the two. I did compile this using an -O3 flag, and saw modest performance improvements. I'd like to incorporate SSE explicitly in my code, if I can without too steep of a learning curve.

Can you suggest how I might change my code above? Thanks for the SSE2 suggestion.
I'd like to incorporate SSE explicitly in my code, if I can without too steep of a learning curve.

The only hard part is to select the right instructions to get what you want.
See here for an overview of available instructions: http://msdn.microsoft.com/en-us/library/y0dh78ez.aspx

A naive approach could look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <emmintrin.h>
[...]
__m128i zero=_mm_set1_epi8(0);
__m128i sum=zero;
[...]
  //(inside the loop):
  __m128i vals8=_mm_loadu_si128(reinterpret_cast<__m128i*>(&input -> color[plane][row + i - 1][col + j - 1]));
  __m128i vals16=_mm_unpacklo_epi8(vals8,zero);
  __m128i vals32=_mm_unpacklo_epi16(vals16,zero);
  sum=_mm_add_epi32(sum,vals32);
[...]
int sums[4];
_mm_storeu_si128(reinterpret_cast<__m128i*>(sums),sum);
int value = (sums[0] + sums[1] + sums[2] + sums[3]) / divisor;


This doesn't improve performance much and makes the out-of-bounds accesses worse by reading 16 bytes at a time (when only 4 are needed) and discarding the upper 12 bytes in the following two unpack instructions.

Like I mentioned, serious improvements can be expected when having different subfunctions with a constant, specific filter size (read: a template function with the size as the template parameter). Judging from the inner loop, size appears to be a multiple of 4, so this might be a viable solution.
Thanks for all your help. I really appreciate it!
Topic archived. No new replies allowed.