I'm looking to squeeze the last bits of performance out of my current project and resorting to minor optimizations to do so.
1) To speed up 2D table lookups, I wanted to transform NxN tables to N'xN, N' being N rounded up to the nearest power of 2. I thought this would allow faster index calculation, but the result is a slowdown of 10%. Was this a dumb idea, or did I use it wrong? Would it help if I manually replaced the index calculation with a bitshift of log2(N') rather than a multiplication with N'? (Tested; answer is barely.) The test elements were small (doubles) and the other tables have bigger elements, so the wasted memory size will increase. Testing size of N is 3000.
[edit] Now that I've typed this out, it seems obvious that the cost of array lookups isn't the index calculation. Any other ways of speeding things up?
2) A lot of time is spent on organizing items on a Binary Heap. I designed it so that the Heap elements are only a joined index 'IJ'. This saves me time in the Heap (4bytes to copy rather than 8) and in lookups (IJ is also the index in linearized 2D tables), but means I need to separate them in other occasions, which is expensive (i = IJ/n, j = IJ%n). Changing the system will take some effort. Was it naive to join the index? Is there some way for me to roughly evaluate the different implementations without having to code both?
If there's any information that could lead to better answers, please feel free to ask!
Where is the bottleneck? Did you profile your code? Are you sure that these operations you are trying to optimize make considerable impact on perfomance?
I use VerySleepy to profile runtimes; it's free, but I'm not sure about its accuracy. The results show that a bit less than 1/3rd is spent on updating the Heap. The majority of time is spent in simple calculations that are repeated billions of times. They look like this:
As stated before, the d(a, b) function looks up a value (double) in a 1D vector at index a*SIZE+b. For these tests, SIZE is 3000, which is roughly the size I'm optimizing for. In any case, the numbers don't change much for smaller sizes and I don't imagine they will for larger sizer.
The commented line is the "intuitive" order of the calculation, while the uncommented one is another attempt at speeding up calculations by trying to group together the memory lookups. (The table is symmetric so d(a, b) == d(b, a).) This one did have effect, I believe, but I can't be sure because changing the order of operations and the imprecision of float calculations leads to different results, affecting the behavior.
The commented line is the "intuitive" order of the calculation, while the uncommented one is another attempt at speeding up calculations by trying to group together the memory lookups. (The table is symmetric so d(a, b) == d(b, a).) This one did have effect, I believe, but I can't be sure because changing the order of operations and the imprecision of float calculations leads to different results, affecting the behavior.
Never do this. Almost all modern compilers will rearrange your code for you to be more efficient - manually rearranging your code only reduces readability and understanding and does not speed up the program or compilation times. Compile both with and without optimizations and you will see how well your compiler is working ;)
I'm pretty sure this minor optimization did work, though. I keep track of a lot of low-level measurements and noticed a slight decrease in time/iteration on every set. It's unlikely (though impossible to prove) that this would be entirely attributable to "coincidentally taking easier iterations". If it was, that would mean the reordering of the calculations somehow leads to a better choice of iterations, which is fine by me.
Anyway, to provide some context, the program at hand is an optimization algorithm for a graph problem. It works by at every step calculating the effect of possible changes and storing them into a table (so each pair (i, j) is associated with a value) as well as in a Heap. The Heap is then used to search for a change to execute (not necessarily the highest value). Based on that, a load of table entries are changed and their positions in the Heap updated.
Code-wise, the structure is this:
A type of "change" is chosen; its evaluation functions (value, feasibility, etc) are fetched as function pointers. The table is initialized; its elements are purely values and indexes in the Heap. The Heap is built; every element is an int IJ that refers to the pair (i, j) it represents, as well as its value in the table, Table[IJ] (i.e. Table[i*n+j]). A change IJ is picked from the Heap, transformed to the pair (i, j) and executed. A load of table entries are updated (value) and, through the saved Heap index, the associated Heap element is repositioned according to its new value. Repeat.
Extra question:
The Heap is a significant (25~30%) portion of the runtime. I tried to minimize it by limited the size of each element (IJ, 4bytes). This means I have to split it up again (i = ij/N, j =ij%N), but I'm pretty sure splits are insignificant compared to Heap movements. However, all the Heap work needs to lookup the values in the value Table, roughly like this:
Would it be cheaper to include a copy of the value in the Heap items, requiring more effort per movement (4+8bytes), but spare the double lookup at each comparison?
(Note that while you're welcome to note possible improvements to the algorithm, the actual logic is much more complex. Most filtering/restricting/skipping work in the evaluations is already done. The algorithm itself is pretty efficient, but the code itself may hold some room for improvement.)