I'm trying to optimize a 2-D cross-corellation code I am writing. It inputs a 64 x 64 vector of ints and a 96 x 96 vector of ints and performs a direct cross-correlation. I already have switched to using references to prevent the program from copying over this extra data into the function. However, due to the number of loops in the program, this can be a tedious calculation. I'd love any suggestions anyone might have on speeding things up - I'm really new to C++, and this is one of my first programs I'm developing. I'm also using vectors instead of multi-dimensional arrays after reading some things online saying they are superior. Is this a valid statement? (I'm wondering if a STL container is as good for heavy computation like this).
int CoreData::corr_dcc_p(
const vector< vector<int> >& IA,
const vector< vector<int> >& IB,
vector< vector<int> >& map,
int wind_ysize,
int wind_xsize,
int templ_ysize,
int templ_xsize)
{
int i, j;
int x, y;
int xshift = templ_xsize-wind_xsize;
int yshift = templ_ysize-wind_ysize;
for (y = 0; y < yshift; y++) {
for (x = 0; x < xshift; x++) {
for (i = 0; i < wind_ysize; i++) {
for (j = 0; j < wind_xsize; j++) {
map[y][x] += IA[i][j] * IB[i+y][j+x];
}
}
}
}
return 0;
}
a while ago I was implementing a digital filter to manipulate some images. My processor was actually an fpga which only supported C so no STL there. However I initially had my images as 2 D arrays. The calculations were very complex for every pixel. Not merely some mutliplication and this slowed my app down.
An improvement of about 10% was switching from 2D arrays to 1D array. Yes you can express all kinds of arrays to 1D array. Any array is simply a chunk of memory, even a 2D or nD. You just have to manipulate your indexing better. I think (not 100% sure) that the more dimensions an array has the more dereferencing the compiler makes to access the actual element, therefore a 1D array offers the best performance in this aspect.
Regarding vectors since you only use the access operators here I think you should be fine and performance wise comparable to traditional C arrays.
In addition if you can use lookup tables do so. For example if the values of your vectors are known beforehand you can create look up tables and simply choose the right value of that table with correct indexing.
What compilation options are you using? It seems your algorithm should be amenable for loop-unrolling and SSE vectorisation. Make sure you enable those options in your compiler.
Another possibility is to parallelize this code - things like OpenMP may help.
If this is not enough, you should use a better algorithm: you use an O(n^4) algorithm while a better one exists: O(n^2 log n) using FFT.
By using vectors of vectors you are doing a lot of dereferencing. You can 'cache' that dereferencing by extracting the internal int arrays at strategic points in the loops:
I don't think it will save you anything. If it does, change the compiler. There is no reason for vectors to be slower than arrays. Also, the compiler should move some dereferencing out of the inner loops.
A vector stores its data on the free-store, not the stack. This is a vector of a vector so each time in the loop you are dereferencing a pointer to locate your array of vectors and then dereferencing one of those vector's pointers to locate its array of ints. So this technique should produce some benefit.
I would think that this technique may also be faster for arrays of arrays (int**) but might work slightly slower for 2 dimensional arrays(int[][]).
Thanks for all the input guys, I implemented the cached dereferencing, and saw a massive speedup in my code (order of magnitude speedup, 5 minutes to about 20 seconds). Adding auto vectorization to the build settings reduced it further to about 5 seconds. I'll definitely be keeping this trick in mind as I keep developing the code.
I will be implementing an FFT-based approach soon, but I wanted to make sure I nailed this one first. Thanks again!!
You should also turn on inlining in the options of the compiler, if you haven't done it already.
In GCC it is done by default by -O3 option. The -O2 option itself is not sufficient.