Lets say I have 2 separate functions which do the same thing: simply initialize a 2d array.
Function 1 does it with a straight forward nested for loop.
IE
for i = x; i < y; i++
for j = x; j < y; j++
Function 2 does it with one for loop executed after the next.
IE
for i = x; i < y; i++
for i = x; i < y; i++
Is Function 2 considered faster since we are just adding two O(n) functions together? as opposed to Function 1 which is the product of two O(n) functions.
while it will always be O(N) (N being rows * cols, or the N^2 above for square, however you want to think about it, but to me there are N items).
There are some small things that can make a big difference for very large matrices.
-- is it dynamic? A 2-d pointer construct is not a solid block of memory (that is, vector<vector<thing>> or thing** ). This usually causes page faults which causes slowness.
If it is really a 1d object, or a dimensioned 2d (int x[3][5] style) it is a single block. This is faster for 2 reasons:
1) you only need a single loop variable, so you don't have the double jumps and double loop variable handling and double comparison of nested loops. This is a small cost, but it adds up.
2) you avoid the page faults as much as possible by just iterating one block, or multiple sequential blocks. Page faults are a big cost.
even when accessing a solid block, you still want to access across rows, that is access this in 1234 order:
123
456
789
and avoid accessing it in 147, 258,369 format. This also causes page faults, even if its a solid block, if the # of columns is very high.
you can see this in action. If you try to multiply 2 large matrices with the brute force N^3 algorithm, then do it again in a version that transposes first, then goes across the rows in both matrix, it will go notably faster in spite of the time spent copying to transposed memory.
At some point (going to smaller and smaller matrices) it will eventually take more time to transpose than to just do the work, and that point will vary from machine to machine.
If that isnt enough to take in, you can also multi thread it, doing 1 block (several rows each maybe?) on each cpu. How you divide the memory up there matters too!