Which substitute for non-type template parameter ?

Hi all,

I would like to define a templated class with a non-type parameter like this :

1
2
3
4
5
6
7
8
9
10
template <int s>
class C40 {
public:
	int m[s];
	int other1;
	float other2[400];
	double other3[200];
};
template class C40<200>;
template class C40<300>;


But I need to declare variables with a non const, what is not authorized :

1
2
int i  = 200;
C40<i> toto[1000000];


Compiler says :

ā€™iā€™ cannot appear in a constant-expression


For my usage, toto will end in a shared memory.

An alternative would be to separate m from C40 and to allocate it dynamically, but then for the sake of the shared memory, it would become a mess to manage it.

Any idea please ?
Last edited on
Why would it be a mess to manage the individually dynamically allocated m?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <iostream>
#include <vector>

class C40 {
 public:
    std::vector<int> m;
    int other1;
    float other2[400];
    double other3[200];
    C40(int s) : m(s) {}
};

int main()
{
    int i = 200;
    std::vector<C40> toto(1000000, C40(i));
}


Your C40<200> and C40<300> are two different types. C++ only supports arrays/vectors of objects of the same type, so there can never be a "C40<i> toto[...]" for some variable i: only "C40<200> toto[...]" or "C40<300> toto[...]"
Thanks Cubbi.

For my understanding, C40(int s) : m(s) {} initializes m with s isn't it ? As s is an int and not a std::vector<int>, it resizes m to size s ?

It is then equivalent to : C40(int s) {m.resize(s);} ?

Some more questions :

1) In my shared memory context where toto will be mapped in /dev/toto, shall I manage capacity with reserve to prevent memory waste ?

2) I am concern a lot with memory ressource because my data are huge. What is the overhead of vectors versus arrays ?
Last edited on
C40(int s) : m(s) {} will use the vector constructor that takes the size as argument.

C40(int s) {m.resize(s);} will first construct m with the default constructor and after that call resize to resize the vector.

The end result will be the same but first way is possibly a little faster because it has to do less.
1) define memory waste. To place containers such as vectors in shared memory we use allocators. The allocator will receive the same request from the constructor above as from a call to reserve().

2) there is no overhead when compared to a dynamically allocated array. But indeed, your original structure, with a fixed size array member, would take up less room and avoid indirection . Perhaps you could deal with a vector<C40<200> > toto2 and vector<C40<300> > toto3?
Thanks for the clarifications.

I also found this interesting :
http://stackoverflow.com/questions/381621/using-arrays-or-stdvectors-in-c-whats-the-performance-gap
Most contributors say the overhead of vectors vs arrays is not significant, except duli.

I have re-worked the example of duli :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class X42 {
public:
	vector<int> vec;
	X42(const int s) : vec(s) {}
	X42(const vector<int>& v) {vec = v;}
	int first() { return vec[0];}
};
class Y42 {
public:
	int f[3];
	Y42(int a[]) {f[0]=a[0]; f[1]=a[1];f[2]=a[2];}
	int first() { return f[0];}
};

int main() {
			int sum = 0;
			X42 v1(3);
			v1.vec[0]=1; v1.vec[1]=2;v1.vec[2]=3;
			int starttime1=time(NULL);
			for (int i=0;i<50000;i++) for (int j=0;j<10000;j++) {
				X42 x(v1);
				sum+=x.first() + i + j;
			}
			int endtime1=time(NULL);
			cout << "duration with vectors : " << endtime1 - starttime1 << endl; // = 19 - user	0m19.288s
//
//			int v2[3];
//			v2[0]=1; v2[1]=2;v2[2]=3;
//			int starttime2=time(NULL);
//			for (int i=0;i<500000;i++) for (int j=0;j<100000;j++) {
//				Y42 x(v2);
//				sum+=x.first() + i + j;
//				//cout << "i = " << i <<", j = " << j << ", sum = " << sum << endl;
//			}
//			int endtime2=time(NULL);
//			cout << "100 x duration with arrays : " << endtime2 - starttime2 << endl; // = 36 - user	0m36.317s
			cout << sum << endl;
}


When I comment the array part once and the vector part in a second time, I obtain that in this use case, arrays are more than 50 times faster than vectors.

$ time ./test 42
100 x duration with arrays : 36
-1482588160

real	0m36.389s
user	0m36.317s
sys	0m0.000s
$ time ./test 42
duration with vectors : 19
1974202368

real	0m19.326s
user	0m19.288s
sys	0m0.001s


duli says :
Sometimes arrays are indeed better than vectors. If you are always manipulating a fixed length set of objects, arrays are better.


I am in this case. And my application deals with very huge data sets. So memory footprints and cpu footprints are a concern. What do you thing about it ?

Then I am back to my former question, without the help of vectors.....
Last edited on
When I comment the array part once and the vector part in a second time, I obtain that in this use case, arrays are more than 50 times faster than vectors.
Am I reading this wrong, is 36ms quicker than 19ms? (Smaller is better right?)

Also I think this test case is a little off, the array is using fixed indexes in it's constructor where the vector is first using the default constructor for the vector then using the assignment operator, I don't think that is apples to oranges.

Also, why are you using a template like this? If you have many different sizes you will have serious code bloat, which doesn't make sense because other than the size of your array, the class functions identically.
Last edited on
Am I reading this wrong, is 36ms quicker than 19ms? (Smaller is better right?)

For arrays, the loop is performed 100 times more, then you have to compare 0.36ms to 19ms.

the vector is first using the default constructor for the vector then using the assignment operator, I don't think that is apples to oranges.

Yes but this is done once before the loop, to not disadvantage vectors. So in my opinion, that is apples to apples. But I am not an expert. If you propose me some code, I will test it.

Also, why are you using a template like this?

I will have only 4 different sizes with tens to hundreds of Mb each, in a shared memory mapped to /dev. I will have other part of data in the class that will differentiate. I am open to suggestions of course.
but this is done once before the loop,

there is a vector constructed inside the loop, line 21, which allocates storage from the heap (by default), which is of course slower than constructing a fixed-size array on stack at line 31.

I will have only 4 different sizes with tens to hundreds of Mb each, in a shared memory mapped to /dev.

If I were you, I'd create 4 different vectors using an allocator that accesses your shared memory. If your API is one of the ones supported by boost, you could use boost.interprocess library for that http://www.boost.org/doc/libs/release/doc/html/interprocess/quick_guide.html#interprocess.quick_guide.qg_interprocess_container
vectors only have a plus when you use the dynamic sizing. In your case you can gladly use POD arrays
For arrays, the loop is performed 100 times more, then you have to compare 0.36ms to 19ms.
Yup, my mistake, I saw your comment about the 100 x's and looked at the loop counts (but must have just breezed over them because I thought they were the same).

Either way, as Duli stated in his post, the draw back of this test case is the constructor calls for the vector in the inner loop, so this test is definitely skewed to favor arrays.

I will have only 4 different sizes with tens to hundreds of Mb each, in a shared memory mapped to /dev. I will have other part of data in the class that will differentiate. I am open to suggestions of course.
Is the array/vector for the class truly going to be integers only (as is the test case)? After running some sample tests myself, with the algorithm a little different to give the vector a fair shot, I found the vector to take twice as long as the array, however, I may not have all the optimization settings correct. Also, I believe the difference in time is also attributed to cache misses, the vector has additional memory, since the array and vector are using ints, this is easily aligned for the array. Change the type to some user defined not so aligned type and you will see the gap close almost completely and in some cases the vector outperforms the array, as is this test case... using your modified code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
struct TestObject
{
	TestObject():int_type(1), string_type("Some string"), double_type(3.4), char_type('c')
	{
	}
	int int_type;
	string string_type;
	double double_type;
	char char_type;
};

int _tmain(int argc, _TCHAR* argv[])
{
	long sum = 0;
	vector<TestObject> vec1;
	vec1.push_back(TestObject());
	vec1.push_back(TestObject());
	vec1.push_back(TestObject());

	vector<TestObject> vec2;
	vec2.push_back(TestObject());
	vec2.push_back(TestObject());
	vec2.push_back(TestObject());

	TestObject arr1[3] = { TestObject(), TestObject(), TestObject()};
	TestObject arr2[3] = { TestObject(), TestObject(), TestObject()};

	time_t start1, start2, end1, end2;
	const int outer = 50000, inner = 10000;

	clock_t starttime1=clock();
	time(&start1);
	for (int i=0;i<outer;i++) for (int j=0;j<inner;j++) {
		vec1[0] = vec2[0];
		vec1[1] = vec2[1];
		vec1[2] = vec2[2];
		sum+=vec1[0].int_type + i + j;
	}
	time(&end1);
	clock_t endtime1=clock();
	cout << "Elapsed time with vectors    : " << difftime(end1,start1) << endl;
	cout << "Clocks Per/sec with vectors : " << ((double)endtime1-starttime1)/CLOCKS_PER_SEC << endl; // = 19 - user	0m19.288s
	cout << sum << endl;
	sum = 0;
	
	clock_t starttime2=clock();
	time(&start2);
	for (int i=0;i<outer;i++) for (int j=0;j<inner;j++) {
		arr1[0] = arr2[0];
		arr1[1] = arr2[1];
		arr1[2] = arr2[2];
		sum+=arr1[0].int_type + i + j;
	}
	time(&end2);
	clock_t endtime2=clock();
	cout << "Elapsed time with arrays   : " << difftime(end2,start2) << endl;
	cout << "Clocks Per/Sec with arrays : " << ((double)endtime2-starttime2)/CLOCKS_PER_SEC  << endl; // = 36 - user	0m36.317s
	cout << sum << endl;
	return 0;
}



Elapsed time with vectors    : 30
Clocks Per/sec with vectors : 30.125
1974202368
Elapsed time with arrays   : 36
Clocks Per/Sec with arrays : 36.611
1974202368


Last edited on
Thanks all for this very interresting discussion.

clanmjc, on my machine, I obtain :

Elapsed time with vectors    : 10
Clocks Per/sec with vectors : 9.96
15000000000000
Elapsed time with arrays   : 10
Clocks Per/Sec with arrays : 9.96
15000000000000


Here are my compiling flags : -O3 -Wall -c -fmessage-length=0 -std=c++0x -std=gnu++0x

Now, if I modify TestObject to contain four integers :

1
2
3
4
5
6
7
8
struct TestObject
{
	TestObject():int_type(1), int2(2), int3(3), int4(4) {}
	int int_type;
	int int2;
	int int3;
	int int4;
};


I obtain :

Elapsed time with vectors    : 2
Clocks Per/sec with vectors : 2
15000000000000
Elapsed time with arrays   : 0
Clocks Per/Sec with arrays : 0
15000000000000


And with 100x more loops, say with outer = 500000, inner = 100000 :

Elapsed time with vectors    : 127
Clocks Per/sec with vectors : 126.86
15000000000000000
Elapsed time with arrays   : 0
Clocks Per/Sec with arrays : 0
15000000000000000


And with 10000x more loops, say with outer = 5000000, inner = 1000000, it is the same. So I conclude the compiler makes too smart optimisation when code is useless.

Let's modify your code to use the data and avoid such optimisation :

1
2
3
4
5
6
7
8
9
10
	for (int i=0;i<outer;i++) for (int j=0;j<inner;j++) {
		vec1[0] = vec2[0];
		vec1[1] = vec2[1];
		vec1[2] = vec2[2];
		sum+=vec1[0].int_type + i + j + vec1[0].int2 + vec1[1].int3 + vec1[2].int4;
		vec2[1].int2 = vec1[2].int_type;
		vec2[1].int4 = vec1[2].int3;
		vec2[2].int2 = vec1[0].int_type;
		vec2[2].int4 = vec1[0].int3;
	}


1
2
3
4
5
6
7
8
9
10
	for (int i=0;i<outer;i++) for (int j=0;j<inner;j++) {
		arr1[0] = arr2[0];
		arr1[1] = arr2[1];
		arr1[2] = arr2[2];
		sum+=arr1[0].int_type + i + j + arr1[0].int2 + arr1[1].int3 + arr1[2].int4;
		arr2[1].int2 = arr1[2].int_type;
		arr2[1].int4 = arr1[2].int3;
		arr2[2].int2 = arr1[0].int_type;
		arr2[2].int4 = arr1[0].int3;
	} 


Elapsed time with vectors    : 5
Clocks Per/sec with vectors : 5.05
15004000000001
Elapsed time with arrays   : 1
Clocks Per/Sec with arrays : 0.54
15004000000001


So with integers only, I obtain a gain of 10x times with arrays vs vectors !!! With -o2, I have the same result. With -o1, arrays is identical, vectors is worse : 7.17. With no optimisations at all, I get 25.75 for vectors vs 5.08 for arrays, 25.75/5.08=5.

As I will deal in my tables only with integers, as the size will not change nor the contents, and as my shared data are huge, I will stick to arrays.

vectors only have a plus when you use the dynamic sizing. In your case you can gladly use POD arrays
Yep, thanks coder777
Last edited on
As I will deal in my tables only with integers, as the size will not change nor the contents, and as my shared data are huge, I will stick to arrays.


Yeah that is what I was getting at as well, definitely a fun discussion.
So with integers only, I obtain a gain of 10x times with arrays vs vectors !!!

If you look at the generated code, you will notice that gcc threw your array out and simply calculated sum in the second case, without any memory accesses (at least that's what it does for me).

Notice that if you'd actually compared apples to apples (heap-allocated array to vector), you'd get the same results wth gcc.
If you look at the generated code, you will notice that gcc threw your array out and simply calculated sum in the second case, without any memory accesses


That's a killing argument ! I am going to try to build another example where the compiler is not expected to perform such optimisation.

Notice that if you'd actually compared apples to apples (heap-allocated array to vector), you'd get the same results wth gcc.


Could you elaborate please, I am lost here.
Here is the new code :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#define SOMESIZE 10000u

	long sum = 0;
	time_t start1, start2, end1, end2;
	const int outer = 50000, inner = 10000;

	vector<TestObject> vec1(SOMESIZE);
	// initialise vec1
	for (int i = 0; i < SOMESIZE; i++) {
		vec1[i].int_type = rand() % 1000;
		vec1[i].int2 = rand() % 1000;
		vec1[i].int3 = rand() % 1000;
		vec1[i].int4 = rand() % 1000;
	}
	clock_t starttime1=clock();
	time(&start1);
	for (int i=0;i<outer;i++) for (int j=0;j<inner;j++) {
		sum += vec1[(i*j) % SOMESIZE].int_type + vec1[(i+j) % SOMESIZE].int2 + vec1[rand() % SOMESIZE].int3 + vec1[rand() % SOMESIZE].int4;
	}
	time(&end1);
	clock_t endtime1=clock();
	cout << "Elapsed time with vectors    : " << difftime(end1,start1) << endl;
	cout << "Clocks Per/sec with vectors : " << ((double)endtime1-starttime1)/CLOCKS_PER_SEC << endl; // = 19 - user	0m19.288s
	cout << sum << endl;
	sum = 0;

	TestObject arr1[SOMESIZE];
	// initialise vec1
	for (int i = 0; i < SOMESIZE; i++) {
		arr1[i].int_type = rand() % 1000;
		arr1[i].int2 = rand() % 1000;
		arr1[i].int3 = rand() % 1000;
		arr1[i].int4 = rand() % 1000;
	}

	clock_t starttime2=clock();
	time(&start2);
	for (int i=0;i<outer;i++) for (int j=0;j<inner;j++) {
		sum += arr1[(i*j) % SOMESIZE].int_type + arr1[(i+j) % SOMESIZE].int2 + arr1[rand() % SOMESIZE].int3 + arr1[rand() % SOMESIZE].int4;
	}
	time(&end2);
	clock_t endtime2=clock();
	cout << "Elapsed time with arrays   : " << difftime(end2,start2) << endl;
	cout << "Clocks Per/Sec with arrays : " << ((double)endtime2-starttime2)/CLOCKS_PER_SEC  << endl; // = 36 - user	0m36.317s
	cout << sum << endl;


Elapsed time with vectors    : 12
Clocks Per/sec with vectors : 12.56

Elapsed time with arrays   : 12
Clocks Per/Sec with arrays : 12.05


Eventually, even with only aligned data (integers), I find the performances are very close. So I am going to change my decision and use vectors.

A big thank to all of you.
Notice that if you'd actually compared apples to apples (heap-allocated array to vector), you'd get the same results wth gcc.

Could you elaborate please, I am lost here.


std::vector<int> v(size);, by default, places the array on heap -- it uses std::allocator<int>, which uses new. That's different from using, for example int arr[size];, which places the array on stack. *those* arrays are faster than vectors for some operations, and are amenable to more optimizations as you saw. The C++ way to write them is std::array<int> arr(size);, although it only became popular some 10 years ago and became standard last year.

But since your goal is to drop the data into shared memory, you'll either manually placement-new your raw array in there, or you'll use a vector with a shared memory allocator. I use those allocators a lot at work, but unlike vectors, they are not beginner stuff (unless you can use boost, then it's easy)
Last edited on
Topic archived. No new replies allowed.