Thread much slower than the function itself?

Hi everyone!
I've made a function that runs fast but when i use a single "thread" to perform it the application slow down from 650fps to 50/100fps!!

Unfortunately the code i'm using is for the SFML library, a very simple example and i'm sure you can understand it: i just want to fill the window with a color.

I assure you that the problem is not in the effective displaying of the image but in the function/thread, so what am i doing wrong?

EDIT: i know that thread are oriented to parallel computing (doing multiple [indipendent] things at the same time). In fact my objective is to render this image using 4 threads, one for each 1/4 rows of this simple grey image, but i had a terrible performance drop, so i present you this single-thread to highlight that.

Here's the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#include <SFML/Graphics.hpp>
#include <thread>

using namespace std;

const int H = 720;
const int W = 1280;
sf::Uint8* pixels = new sf::Uint8[W * 800 * 4];

void render(int row_0, int row_1)
{
    for (int ir = row_0; ir < row_1; ir++)
    {
        for (int ic = 0; ic < W; ic++)
        {
            pixels[(ir * W + ic) * 4] = 100;  //red
            pixels[(ir * W + ic) * 4 + 1] = 100;  //green
            pixels[(ir * W + ic) * 4 + 2] = 100;  //blue
        }
    }
}


int main()
{
    sf::RenderWindow window(sf::VideoMode(W, H), "SFML works!");
    window.setFramerateLimit(0);
    sf::Texture texture;   texture.create(W, H);
    sf::Sprite sprite(texture);

    for (int ir = 0; ir < W * H * 4; ir++)
        pixels[ir] = 255; //set alpha to max


    while (window.isOpen())
    {
        thread T1(render, 0, 720);
        T1.join();

        //render( 0, 720);
        texture.update(pixels);
        window.draw(sprite);
        window.display();
    }

    return 0;
}



EDIT Here the multi-thread code:
[spoiler]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#include <SFML/Graphics.hpp>
#include <thread>


using namespace std;


const int H = 720;
const int W = 1280;

sf::Uint8* pixels = new sf::Uint8[W * 800 * 4];


void render(int row_0, int row_1)
{

    for (int ir = row_0; ir < row_1; ir++)
    {
        for (int ic = 0; ic < W; ic++)
        {
            pixels[(ir * W + ic) * 4] =       100;    //red
            pixels[(ir * W + ic) * 4 + 1] = 100;    //green
            pixels[(ir * W + ic) * 4 + 2] = 100;    //blue
        }
    }
}


int main()
{
    sf::RenderWindow window(sf::VideoMode(W, H), "SFML works!");
    window.setFramerateLimit(0);
    sf::Texture texture;   texture.create(W, H);
    sf::Sprite sprite(texture);



    for (int ir = 0; ir < W * H * 4; ir++)
    {
        pixels[ir] = 255; //set alpha to max
    }

    while (window.isOpen())
    {
        //thread T1(render, 0, 180);
        //thread T2(render, 180, 360);
        //thread T3(render, 360, 540);
        //thread T4(render, 540, 720);
        //T1.join();
        //T2.join();
        //T3.join();
        //T4.join();

        render(0, 720);
        texture.update(pixels);
        window.draw(sprite);
        window.display();
    }
    return 0;
}

But i repeat that the performance drop is present both with 4 and just 1 thread cases.
[/spoiler]


PS How can i do the spoiler?
Last edited on
37
38
thread T1(render, 0, 720); // start a thread T1
T1.join(); // wait for T1's completion 
Last edited on
I'm not sure how cheap it is to create a new thread. You might want to test if reusing the same thread for all loop iterations is faster.

The performance advantage of using threads comes from being able to do multiple things at the same time but in your case you just wait for the thread to finish before doing anything else so the best you can hope for is coming close to the single-threaded program.
Last edited on
The performance advantage of using threads comes from being able to do multiple things at the same time but in your case you just wait for the thread to finish before doing anything else so the best you can hope for is coming close to the single-threaded program.
+1

Moving computations to a separate thread only makes sense, if you want to keep your "main" threads free for other work (e.g. process GUI events) in the meantime, but it will not "magically" speed up the computation at all. If all you do is making your "main" thread wait for the completion of the other thread, then using a separate thread is rather pointless. This won't even keep your GUI responsive, as join() is a blocking call!
Last edited on
I know that thread are oriented to parallel computing (do multiple [indipendent] things at the same time, here's where the speed up of the application would come from). In fact my objective is to render this image using 4 threads, one for each 1/4 rows of this simple grey image, but i had a terrible performance drop, so i present you this single-thread to highlight that.

Here's the multi-thread code:
[spoiler]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#include <SFML/Graphics.hpp>
#include <thread>


using namespace std;


const int H = 720;
const int W = 1280;

sf::Uint8* pixels = new sf::Uint8[W * 800 * 4];


void render(int row_0, int row_1)
{

    for (int ir = row_0; ir < row_1; ir++)
    {
        for (int ic = 0; ic < W; ic++)
        {
            pixels[(ir * W + ic) * 4] =       100;    //red
            pixels[(ir * W + ic) * 4 + 1] = 100;    //green
            pixels[(ir * W + ic) * 4 + 2] = 100;    //blue
        }
    }
}


int main()
{
    sf::RenderWindow window(sf::VideoMode(W, H), "SFML works!");
    window.setFramerateLimit(0);
    sf::Texture texture;   texture.create(W, H);
    sf::Sprite sprite(texture);



    for (int ir = 0; ir < W * H * 4; ir++)
    {
        pixels[ir] = 255; //set alpha to max
    }

    while (window.isOpen())
    {
        //thread T1(render, 0, 180);
        //thread T2(render, 180, 360);
        //thread T3(render, 360, 540);
        //thread T4(render, 540, 720);
        //T1.join();
        //T2.join();
        //T3.join();
        //T4.join();

        render(0, 720);
        texture.update(pixels);
        window.draw(sprite);
        window.display();
    }
    return 0;
}

[/spoiler]

But i repeat that the performance drop is present both with 4 and just 1 thread cases.
Last edited on
As others have pointed out already, starting a new thread as well as joining a thread has a certain overhead. So, if the total time required for the computation is pretty short, then this overhead can be quite significant! The thread start/join overhead will be the smaller (relatively!) the longer the actual computation takes...

(Using a thread pool could help to reduce the thread management overhead)
Last edited on
Mm sure, creating a thread take a cost in computational time, but i was quite sure that using more threads could speed up the application.

My algorithm simply fills an array with '100s' so i supposed that giving to a thread 1/4 of the work could make them have the work done faster.. am i wrong?

Btw i'm using a 6C/12T i5-10th CPU
It is not easy to give a definitive answer. If you can split the computation evenly across 4 threads, then each thread will only have to do 1/4 of the work and therefore each thread will be able to complete its part of the computation in roughly 1/4 of the time that a single thread would take. But then again, starting 4 threads and joining with 4 threads takes some additional time! Depending on how long the actual computation takes, the overhead for starting/joining 4 threads may be either significant (prohibitive) or negligible...

Again: You could try using a thread pool, i.e. create the threads once and then re-use them whenever you need to run the computation (instead of creating new threads every time), and see if that helps.

Another thing to consider: With 4 parallel threads the memory access pattern may be less "optimal" (e.g. with respect to caching), compared to a single thread that reads/write the data in a strictly linear fashion!

After all, I think you should do some more measurements. Anything else will be speculation...
Last edited on
My function is doing very little here: 1280x720x3 change of array value, is a very small task..

..but still, i'm only making this in 1 thread on a quite good cpu so the -85% of speed (speed as a sensitive meter of performance) still sound a bit strange to me.. as a noob.

What can i do to speed up this simple code? Sigh i really thought that divide the screen to threads would literallly double or quadrouple the framerate! (Is for a future project)
Last edited on
My function is doing very little here: 1280x720x3 change of array value, is a very small task..

The smaller the task, the less it will be able to benefit from splitting the work across multiple threads, and the more detrimental the thread creation (and join) overhead will be 🙄

See the previous posts for suggestions on how you may be able to improve things...
_________________

Maybe the loop could be slightly simplified to:
1
2
3
4
5
6
7
8
9
10
const int SIZE = W * 800 * 4;
sf::Uint8 *pixels = new sf::Uint8[SIZE];

void render()
{
    for (int i = 0; i < SIZE; i += 4)
    {
        pixels[i+2] = pixels[i+1] = pixels[i] = 100;
    }
}
Last edited on
I don't anything about SFML, but it's possible that sf::RenderWindow already has some concurrency or other optimisations builtin, but as others have mentioned maybe not: somethings are better left in a single thread. One reason that we have these libraries is that their routines are already as good as they can be; don't try to "improve them". One of the things it probably is doing is running CUDA code on the GPU.
SFML's graphics framework is a wrapper over OpenGL, so it wouldn't be doing anything with CUDA (I know you said you didn't know much about SFML so I'm not trying to call you out, just explaining it for others).

Also, none of the posts in this thread, when taken verbatim, are correctly using threading.

I question the initial goal. Individually writing to every single pixel, every single frame/loop iteration seems awfully wasteful. It's not clear what you are actually trying to do. If you are trying to implement some animation system, I'm sure there are general tutorials and SFML-specific tutorials for animation.

Threading (or shaders, or CUDA/OpenCL parallelism) would probably show more of a benefit if the calculation of each pixel actually involved meaningful work, and wasn't just a memcpy of 100. And +1 to what kigar said about thread pools. It might be the creation of those threads every single frame has some overhead alone.
Last edited on
The smaller the task, the less it will be able to benefit from splitting the work across multiple threads, and the more detrimental the thread creation (and join) overhead will be 🙄

This is Amdahl's law:
https://en.wikipedia.org/wiki/Amdahl%27s_law

BTW, frames-per-second is the reciprocal of the measure you're actually interested in. Consider working with frame time, seconds-per-frame, instead.
Last edited on
As already mentioned creating threads should be done only once at the beginning. Since those threads are starting their work immediately you need to introduce some waiting that can be done with condition_variable. See:

http://www.cplusplus.com/reference/condition_variable/condition_variable/?kw=condition_variable

Maybe something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
#include <SFML/Graphics.hpp>
#include <thread>
#include <mutex>              // std::mutex, std::unique_lock
#include <condition_variable> // std::condition_variable


std::mutex mtx;
std::condition_variable cv;
bool ready = false;
bool done = false;


using namespace std;


const int H = 720;
const int W = 1280;

sf::Uint8* pixels = new sf::Uint8[W * 800 * 4];


void render(int row_0, int row_1)
{
  while(!done)
  {
    {
      std::unique_lock<std::mutex> lck(mtx);
      while (!ready) cv.wait(lck);
      ready = false;
    }

    for (int ir = row_0; ir < row_1; ir++)
    {
        for (int ic = 0; ic < W; ic++)
        {
            pixels[(ir * W + ic) * 4] =       100;    //red
            pixels[(ir * W + ic) * 4 + 1] = 100;    //green
            pixels[(ir * W + ic) * 4 + 2] = 100;    //blue
        }
    }
  }
}


int main()
{
    sf::RenderWindow window(sf::VideoMode(W, H), "SFML works!");
    window.setFramerateLimit(0);
    sf::Texture texture;   texture.create(W, H);
    sf::Sprite sprite(texture);



    for (int ir = 0; ir < W * H * 4; ir++)
    {
        pixels[ir] = 255; //set alpha to max
    }

        thread T1(render, 0, 180);
        thread T2(render, 180, 360);
        thread T3(render, 360, 540);
        thread T4(render, 540, 720);

    while (window.isOpen())
    {
        //thread T1(render, 0, 180);
        //thread T2(render, 180, 360);
        //thread T3(render, 360, 540);
        //thread T4(render, 540, 720);
        //T1.join();
        //T2.join();
        //T3.join();
        //T4.join();

        render(0, 720);
        texture.update(pixels);
      {
        std::unique_lock<std::mutex> lck(mtx);
        ready = true;
        cv.notify_all();
      }

        window.draw(sprite);
        window.display();
    }

  {
    std::unique_lock<std::mutex> lck(mtx);
    done = true;
    ready = true;
    cv.notify_all();
  }

        T1.join();
        T2.join();
        T3.join();
        T4.join();

    return 0;
}
Not tested!

It not guaranteed that this is faster since the thread handling consumes time as well.
Last edited on
Topic archived. No new replies allowed.