Profiling code, 4 threads slower then 1

I am doing something wrong since for me 4 threads perform 2 times slower then 1.
I have 2 vectors with bunch of data to process, there is no concurrency (not moving elements and are independent of each other) so i just need to calculate some data from one and copy result in another.

Relevant part
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
LARGE_INTEGER getFrequency()
{
	LARGE_INTEGER frequency;
	QueryPerformanceFrequency(&frequency);
	return frequency;
}

LONGLONG getCurrentTime()
{
    HANDLE currentThread   = GetCurrentThread();
    DWORD_PTR previousMask = SetThreadAffinityMask(currentThread, 1);

    static LARGE_INTEGER frequency = getFrequency();

    LARGE_INTEGER time;
    QueryPerformanceCounter(&time);

    // Restore the thread affinity
    SetThreadAffinityMask(currentThread, previousMask);

    // Return the current time as microseconds
    return 1000000 * time.QuadPart / frequency.QuadPart;
}

int main () 
{
	...

	std::vector<std::thread> threads(4);
	const int grainsize = vPointsIn.size() / 4;

	auto before = getCurrentTime();

#if 0
	transformPoints(&vPointsOut[0], &viewProj, &vPointsIn[0], vPointsIn.size());

#else
	for(std::size_t i = 0; i < 4; ++i)
	{
		threads[i] = std::thread(transformPoints, &vPointsOut[i * grainsize], &viewProj, &vPointsIn[i * grainsize], grainsize);
	}
	for(auto&& i : threads)
	{
		i.join();
	}
#endif

	auto after = getCurrentTime();
	auto lap = after - before;
	std::cout << "Time: " << lap << "mcs" << std::endl;
Well, this is not the relevant part. Relevant part is the transformPoints function, the size of your data and your hardware configuration. Because if your vPointsIn is something like a 100 elements then the overhead of creating the threads will be higher than benefit from parallel computation. Similarly if your CPU has only 2 cores, 4 threads will not be faster than two.

I would recommend using something like Intel TBB for such kinds of tasks as it is very smart in using the CPU resources - it uses optimal number of threads as well as takes care about cache coherency.
I have AMD Phenom II X4, so 4 threads should be normal to choose?
Number of points in a vector is 200.000.

1
2
3
4
5
6
7
8
9
10
11
void transformPoints(float4* vOut, const float4x4* matrix, const float4* vIn, std::size_t N) 
{
	for(std::size_t i = 0; i < N; ++i)
	{
		float norm = 1.0f / (matrix->m[0][3] * vIn[i].x + matrix->m[1][3] * vIn[i].y + matrix->m[2][3] * vIn[i].z + matrix->m[3][3]);
		vOut[i].x  = (matrix->m[0][0] * vIn[i].x + matrix->m[1][0] * vIn[i].y + matrix->m[2][0] * vIn[i].z + matrix->m[3][0]) * norm;
		vOut[i].y  = (matrix->m[0][1] * vIn[i].x + matrix->m[1][1] * vIn[i].y + matrix->m[2][1] * vIn[i].z + matrix->m[3][1]) * norm;
		vOut[i].z  = (matrix->m[0][2] * vIn[i].x + matrix->m[1][2] * vIn[i].y + matrix->m[2][2] * vIn[i].z + matrix->m[3][2]) * norm;
		vOut[i].w  = 1.0f;				  
	}
}

Compiled with VS2012, "Relase mode".
Single threaded: ~2000 microseconds
4 threads: ~2900 microseconds
How much time is actually taken? Does it change when you rerun the routine in a loop to elimiate fetch issues?

How does WaitForMultipleObjects compare with all those joins?
Last edited on

How much time is actually taken?

The number of points might even increase and there is lot more work to be done by CPU aside this, so i was looking for a way to speed things up if possible. I think there is no concurrency here so i don't need to mess with locks/mutexs and what not?


Does it change when you rerun the routine in a loop to elimiate fetch issues?

I don't understand this.


How does WaitForMultipleObjects compare with all those joins?


I tried that but function fails, and returns 6 as last error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
int main () 
{
	...

	std::vector<std::thread> threads(4);
	const int grainsize = vPointsIn.size() / 4;
        HANDLE handles[4];
	handles[0] = (HANDLE)threads[0].native_handle();
	handles[1] = (HANDLE)threads[1].native_handle();
	handles[2] = (HANDLE)threads[2].native_handle();
	handles[3] = (HANDLE)threads[3].native_handle();

	auto before = getCurrentTime();

#if 0
	transformPoints(&vPointsOut[0], &viewProj, &vPointsIn[0], vPointsIn.size());

#else
	for(std::size_t i = 0; i < 4; ++i)
	{
		threads[i] = std::thread(transformPoints, &vPointsOut[i * grainsize], &viewProj, &vPointsIn[i * grainsize], grainsize);
	}
	//for(auto&& i : threads)
	//{
	//	i.join();
	//}
        DWORD result = WaitForMultipleObjects(4, &handles[0], TRUE, INFINITE);
        if(WAIT_FAILED == result)
	{
		DWORD lastError = GetLastError();
		std::cout << "Last error: " << lastError << std::endl;
	}
#endif

	auto after = getCurrentTime();
	auto lap = after - before;
	std::cout << "Time: " << lap << "mcs" << std::endl;
}


I looked it up here: http://msdn.microsoft.com/en-us/library/cc231199.aspx
and it says: ERROR_INVALID_HANDLE
What am i doing wrong?
Last edited on
To use native threads, you start them with _beginthreadex(). It returns the thread HANDLE. You can pass an array of those to WaitForMultipleObjects.

To get a proper feel for measurement, you need to repeat the test. I'm suggesting running that test the same program several times. Just put it in a function and call it time times or so and check if all the runs are around that 2ms / 2.9ms mark.
I have found some open source library reading this article: http://progsch.net/wordpress/?p=81
Studying it a little and come to conclusion that thread creation could be the source of slowness.
Also i needed to edit it a bit since vs2012 doesn't support variadic templates.
Now results are satisfactory, running it this way 4 threads are always about 3 times faster then 1.


To get a proper feel for measurement, you need to repeat the test.

I was doing that by hand :D, rerunning program couple of times and inspecting results.
Cool. Remember to close those handles when you're done. :)
Topic archived. No new replies allowed.