Multi-Dimesional Arrays, Col Maj vs Row Maj Performance

This may be obvious to some, but I see this come up from time to time in people's code where they assign values to an array in column major order. I've looked into execution times using both formats and Row Major is typically several orders of magnitude faster than Column Major assignment.

I'm using Code::Blocks w/ MinGW on Win64.

Consider the following code that investigates both methods:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
#include <iostream>
#include <ctime>
#include <windows.h>

class CTimer {
	private:
		LARGE_INTEGER lFreq, lStart;

	public:
		CTimer() {
			QueryPerformanceFrequency(&lFreq);
		}

	inline void Start(void) {
		        QueryPerformanceCounter(&lStart);
	        }

	inline double Stop(void) {
		        LARGE_INTEGER lEnd;
		        QueryPerformanceCounter(&lEnd);
		        return (double(lEnd.QuadPart - lStart.QuadPart)
			        / lFreq.QuadPart);
	        }

};

class CAssignArray {
	private:
		long iIndex;
		long*** iTestData;
	public:
		CAssignArray() {
			iIndex = 0;
			iTestData = new long** [iIndex];
			for (long i = 0 ; i < iIndex ; ++i) {
				iTestData[i] = new long* [iIndex];
				for (long j = 0 ; j < iIndex ; ++j) {
					iTestData[i][j] = new long[iIndex];
				}
			}
		}

		CAssignArray(long index) {
			iIndex = index;
			iTestData = new long** [iIndex];
			for (long i = 0 ; i < iIndex ; ++i) {
				iTestData[i] = new long* [iIndex];
				for (long j = 0 ; j < iIndex ; ++j) {
					iTestData[i][j] = new long[iIndex];
				}
			}
		}

		~CAssignArray() {
			for (long i = 0 ; i < iIndex ; ++i) {
				for (long j = 0 ; j < iIndex ; ++j) {
					delete [] iTestData[i][j];
				}

				delete [] iTestData[i];
			}

			delete [] iTestData;
		}

		inline void ColumnOrder(void){
			for (long k = 0 ; k < iIndex ; ++k) {
				for (long j = 0 ; j < iIndex ; ++j) {
					for (long i = 0 ; i < iIndex ; ++i) {
						iTestData[i][j][k] = 0;
					}

				}

			}

		}

		inline void RowOrder(void){
			for (long i = 0 ; i < iIndex ; ++i) {
				for (long j = 0 ; j < iIndex ; ++j) {
					for (long k = 0 ; k < iIndex ; ++k) {
						iTestData[i][j][k] = 0;
					}

				}

			}

		}

};

using namespace std;

int main (int argc, char** argv){
	CTimer TotalTime;
	TotalTime.Start();
	for (long index = 0 ; index <= 640 ; index += 128) {
		CAssignArray Array(index);
		CTimer Timer;
		Timer.Start();
		Array.ColumnOrder();
		float fColTime = Timer.Stop();
		Timer.Start();
		Array.RowOrder();
		float fRowTime = Timer.Stop();
		cout << "Array Size in 4-Byte Longs:\t[" << index << "] ["
			<< index << "] [" << index << "]" << endl;
		cout << "Total Mem Usage for Array:\t" <<
			index * index * index * 4 / 1024 / 1024 << " MB" << endl;
		cout << fixed << "Column Order Assignment Time:\t"
			<< fColTime << " s" << endl;
		cout << fixed << "Row Order Assignment Time:\t" <<
			fRowTime << " s" << endl;
		float fDeltaTime = fColTime / fRowTime;
		cout << fixed << "Delta Value (Col/Row):\t\t"
                        << fDeltaTime << endl << endl << endl;
	}

	cout << fixed << "Total Running Time:\t" << TotalTime.Stop()
                << "s" << endl;
	return 0;
}


On my system this produces the following output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
Array Size in 4-Byte Longs:     [0] [0] [0]
Total Mem Usage for Array:      0 MB
Column Order Assignment Time:   0.000002 s
Row Order Assignment Time:      0.000002 s
Delta Value (Col/Row):          1.000000


Array Size in 4-Byte Longs:     [128] [128] [128]
Total Mem Usage for Array:      8 MB
Column Order Assignment Time:   0.027671 s
Row Order Assignment Time:      0.003828 s
Delta Value (Col/Row):          7.227890


Array Size in 4-Byte Longs:     [256] [256] [256]
Total Mem Usage for Array:      64 MB
Column Order Assignment Time:   0.674587 s
Row Order Assignment Time:      0.031342 s
Delta Value (Col/Row):          21.523643


Array Size in 4-Byte Longs:     [384] [384] [384]
Total Mem Usage for Array:      216 MB
Column Order Assignment Time:   2.880800 s
Row Order Assignment Time:      0.107831 s
Delta Value (Col/Row):          26.715811


Array Size in 4-Byte Longs:     [512] [512] [512]
Total Mem Usage for Array:      512 MB
Column Order Assignment Time:   8.375519 s
Row Order Assignment Time:      0.252264 s
Delta Value (Col/Row):          33.201416


Array Size in 4-Byte Longs:     [640] [640] [640]
Total Mem Usage for Array:      1000 MB
Column Order Assignment Time:   17.697803 s
Row Order Assignment Time:      0.494997 s
Delta Value (Col/Row):          35.753338


Total Running Time:     32.220593s

Process returned 0 (0x0)   execution time : 32.236 s
Press any key to continue.


So as you can see substantial run-time penalties can be incurred when using Column Major assignments. I suspect the reverse may be true for Big-Endian systems.

If your code is time sensitive you should keep this in mind.

I hope this helps someone ^^.

Please feel free to make whatever comments you have about this.
Last edited on
Interestingly, on my box, my Row Order times are twice as long and the Column times are upto half as long.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Array Size in 4-Byte Longs:     [0] [0] [0]
Total Mem Usage for Array:      0 MB
Column Order Assignment Time:   0.000000 s
Row Order Assignment Time:      0.000000 s
Delta Value (Col/Row):          1.071240


Array Size in 4-Byte Longs:     [128] [128] [128]
Total Mem Usage for Array:      8 MB
Column Order Assignment Time:   0.033804 s
Row Order Assignment Time:      0.008387 s
Delta Value (Col/Row):          4.030241


Array Size in 4-Byte Longs:     [256] [256] [256]
Total Mem Usage for Array:      64 MB
Column Order Assignment Time:   0.311264 s
Row Order Assignment Time:      0.066459 s
Delta Value (Col/Row):          4.683558


Array Size in 4-Byte Longs:     [384] [384] [384]
Total Mem Usage for Array:      216 MB
Column Order Assignment Time:   1.671457 s
Row Order Assignment Time:      0.223147 s
Delta Value (Col/Row):          7.490384


Array Size in 4-Byte Longs:     [512] [512] [512]
Total Mem Usage for Array:      512 MB
Column Order Assignment Time:   4.476876 s
Row Order Assignment Time:      0.544009 s
Delta Value (Col/Row):          8.229415


Array Size in 4-Byte Longs:     [640] [640] [640]
Total Mem Usage for Array:      1000 MB
Column Order Assignment Time:   10.228748 s
Row Order Assignment Time:      1.030067 s
Delta Value (Col/Row):          9.930178
You're most likely compiling with target -> debug. The extra overhead tends to smooth out the differences.

Try compiling with target -> release. You should find the discrepancy more glaring.
I just compiled it on the commandline using cl without any flags, not using a project.
Last edited on
I suspect a large part of the difference as the array size grows is due to continual swapping.
Assembler dump for ColumnOrder:

_ZN12CAssignArray11ColumnOrderEv:
.LFB1413:
	pushl	%ebp
.LCFI18:
	movl	%esp, %ebp
.LCFI19:
	subl	$16, %esp
.LCFI20:
	movl	$0, -12(%ebp)
	jmp	.L16
.L17:
	movl	$0, -8(%ebp)
	jmp	.L18
.L19:
	movl	$0, -4(%ebp)
	jmp	.L20
.L21:
	movl	8(%ebp), %eax
	movl	4(%eax), %edx
	movl	-4(%ebp), %eax
	sall	$2, %eax
	leal	(%edx,%eax), %eax
	movl	(%eax), %edx
	movl	-8(%ebp), %eax
	sall	$2, %eax
	leal	(%edx,%eax), %eax
	movl	(%eax), %edx
	movl	-12(%ebp), %eax
	sall	$2, %eax
	leal	(%edx,%eax), %eax
	movl	$0, (%eax)
	addl	$1, -4(%ebp)
.L20:
	movl	8(%ebp), %eax
	movl	(%eax), %eax
	cmpl	-4(%ebp), %eax
	jg	.L21
	addl	$1, -8(%ebp)
.L18:
	movl	8(%ebp), %eax
	movl	(%eax), %eax
	cmpl	-8(%ebp), %eax
	jg	.L19
	addl	$1, -12(%ebp)
.L16:
	movl	8(%ebp), %eax
	movl	(%eax), %eax
	cmpl	-12(%ebp), %eax
	jg	.L17
	leave
	ret


Assembler dump for RowOrder:

_ZN12CAssignArray8RowOrderEv:
.LFB1414:
	pushl	%ebp
.LCFI21:
	movl	%esp, %ebp
.LCFI22:
	subl	$16, %esp
.LCFI23:
	movl	$0, -12(%ebp)
	jmp	.L27
.L28:
	movl	$0, -8(%ebp)
	jmp	.L29
.L30:
	movl	$0, -4(%ebp)
	jmp	.L31
.L32:
	movl	8(%ebp), %eax
	movl	4(%eax), %edx
	movl	-12(%ebp), %eax
	sall	$2, %eax
	leal	(%edx,%eax), %eax
	movl	(%eax), %edx
	movl	-8(%ebp), %eax
	sall	$2, %eax
	leal	(%edx,%eax), %eax
	movl	(%eax), %edx
	movl	-4(%ebp), %eax
	sall	$2, %eax
	leal	(%edx,%eax), %eax
	movl	$0, (%eax)
	addl	$1, -4(%ebp)
.L31:
	movl	8(%ebp), %eax
	movl	(%eax), %eax
	cmpl	-4(%ebp), %eax
	jg	.L32
	addl	$1, -8(%ebp)
.L29:
	movl	8(%ebp), %eax
	movl	(%eax), %eax
	cmpl	-8(%ebp), %eax
	jg	.L30
	addl	$1, -12(%ebp)
.L27:
	movl	8(%ebp), %eax
	movl	(%eax), %eax
	cmpl	-12(%ebp), %eax
	jg	.L28
	leave
	ret


ie, they are identical except for two instructions have different constant offsets. But the number of instructions,
the instructions themselves, and the addressing modes are all identical.

(I tried this on Linux and disabled swap, but the results are the same. I also tried with and without compiler
optimizations; no difference).
Last edited on
The reason for this is that the cache locality is very bad for the first method.
So you should always choose whichever method accesses the data in a more sequential manner, see:
http://en.wikipedia.org/wiki/Locality_of_reference

This is especially important when doing image manipulation. One might intuitively do this:
1
2
3
4
5
6
7
for (int x=0;x<width;x++)
{
  for (int y=0;y<height;y++)
  {
    //do something with each pixel
  }
}


But since bitmaps in memory are usually stored in rows, this loop order completely kills performance.
Last edited on
cachegrind output:

Data for ColumnOrder only:

==28047== 
==28047== I   refs:      3,373,725,947
==28047== I1  misses:            2,018
==28047== L2i misses:              702
==28047== I1  miss rate:          0.00%
==28047== L2i miss rate:          0.00%
==28047== 
==28047== D   refs:      2,168,879,033  (1,600,528,348 rd + 568,350,685 wr)
==28047== D1  misses:      963,907,096  (  490,989,806 rd + 472,917,290 wr)
==28047== L2d misses:      322,693,036  (    1,175,544 rd + 321,517,492 wr)
==28047== D1  miss rate:          44.4% (         30.6%   +        83.2%  )
==28047== L2d miss rate:          14.8% (          0.0%   +        56.5%  )
==28047== 
==28047== L2 refs:         963,909,114  (  490,991,824 rd + 472,917,290 wr)
==28047== L2 misses:       322,693,738  (    1,176,246 rd + 321,517,492 wr)
==28047== L2 miss rate:            5.8% (          0.0%   +        56.5%  )


Data for RowOrder only:

==28063== 
==28063== I   refs:      2,429,099,153
==28063== I1  misses:            1,981
==28063== L2i misses:              706
==28063== I1  miss rate:          0.00%
==28063== L2i miss rate:          0.00%
==28063== 
==28063== D   refs:      1,225,150,028  (656,799,718 rd + 568,350,310 wr)
==28063== D1  misses:       31,712,405  (  1,049,070 rd +  30,663,335 wr)
==28063== L2d misses:       26,214,103  (    662,414 rd +  25,551,689 wr)
==28063== D1  miss rate:           2.5% (        0.1%   +         5.3%  )
==28063== L2d miss rate:           2.1% (        0.1%   +         4.4%  )
==28063== 
==28063== L2 refs:          31,714,386  (  1,051,051 rd +  30,663,335 wr)
==28063== L2 misses:        26,214,809  (    663,120 rd +  25,551,689 wr)
==28063== L2 miss rate:            0.7% (        0.0%   +         4.4%  )


Note about L2 cache results:

/valgrind --tool=cachegrind a.out
==28047== Cachegrind, an I1/D1/L2 cache profiler.
==28047== Copyright (C) 2002-2006, and GNU GPL'd, by Nicholas Nethercote et al.
==28047== Using LibVEX rev 1658, a library for dynamic binary translation.
==28047== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==28047== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
==28047== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
==28047== For more details, rerun with: -v
==28047== 
--28047-- warning: Unknown Intel cache config value (0x4E), ignoring
--28047-- warning: L2 cache not installed, ignore L2 results.

@Athar

The article you linked talks about what I was seeing most clearly in this section:

Spatial and temporal locality example: matrix multiplication

A common example is matrix multiplication:

for i in 0..n
for j in 0..m
for k in 0..p
C[i][j] = C[i][j] + A[i][k] * B[k][j];

When dealing with large matrices, this algorithm tends to shuffle data around too much. Since memory is pulled up the hierarchy in consecutive address blocks, in the C programming language it would be advantageous to refer to several memory addresses that share the same row (spatial locality). By keeping the row number fixed, the second element changes more rapidly. In C and C++, this means the memory addresses are used more consecutively. One can see that since j affects the column reference of both matrices C and B, it should be iterated in the innermost loop (this will fix the row iterators, i and k, while j moves across each column in the row). This will not change the mathematical result, but it improves efficiency. By switching the looping order for j and k, the speedup in large matrix multiplications becomes dramatic. (In this case, 'large' means, approximately, more than 100,000 elements in each matrix, or enough addressable memory such that the matrices will not fit in L1 and L2 caches.)


That article and those informational print outs from jsmith really helped me visualize what was going on under the covers.
Last edited on
Topic archived. No new replies allowed.