I am doing something wrong since for me 4 threads perform 2 times slower then 1.
I have 2 vectors with bunch of data to process, there is no concurrency (not moving elements and are independent of each other) so i just need to calculate some data from one and copy result in another.
Well, this is not the relevant part. Relevant part is the transformPoints function, the size of your data and your hardware configuration. Because if your vPointsIn is something like a 100 elements then the overhead of creating the threads will be higher than benefit from parallel computation. Similarly if your CPU has only 2 cores, 4 threads will not be faster than two.
I would recommend using something like Intel TBB for such kinds of tasks as it is very smart in using the CPU resources - it uses optimal number of threads as well as takes care about cache coherency.
The number of points might even increase and there is lot more work to be done by CPU aside this, so i was looking for a way to speed things up if possible. I think there is no concurrency here so i don't need to mess with locks/mutexs and what not?
Does it change when you rerun the routine in a loop to elimiate fetch issues?
I don't understand this.
How does WaitForMultipleObjects compare with all those joins?
I tried that but function fails, and returns 6 as last error:
To use native threads, you start them with _beginthreadex(). It returns the thread HANDLE. You can pass an array of those to WaitForMultipleObjects.
To get a proper feel for measurement, you need to repeat the test. I'm suggesting running that test the same program several times. Just put it in a function and call it time times or so and check if all the runs are around that 2ms / 2.9ms mark.
I have found some open source library reading this article: http://progsch.net/wordpress/?p=81
Studying it a little and come to conclusion that thread creation could be the source of slowness.
Also i needed to edit it a bit since vs2012 doesn't support variadic templates.
Now results are satisfactory, running it this way 4 threads are always about 3 times faster then 1.
To get a proper feel for measurement, you need to repeat the test.
I was doing that by hand :D, rerunning program couple of times and inspecting results.