SDL bad performance

Hey!
I’m testing out SDL and made the following test project to see the performance. I want to make an array of points randomly on the screen, then for every pixel in every frame, find the nearest point and set the pixel’s color to that point’s color.
As is probably obvious, I’m new to C++ and used to higher-level languages.
It works, but I’m getting about 1fps. Any obvious mistakes I made? Is this expected speed for the amount of operations I’m doing (50 points, 680x480 pixels)?

Also, if you have any other advice for me, I'd appreciate it <3

Thanks in advance.

main.h:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <SDL2/SDL.h>
#include <stdio.h>
#include <iostream>
#include <iomanip>
#include <vector>
#include <cstring>
#include <cmath>

const int pointsN = 50;

union Color
{
    struct
    {
        uint8_t r, g, b, a;
    } color;
    uint32_t color_int;
};

struct Point
{
    int x;
    int y;
    Color color;
};

void set_pixel(SDL_Surface *surface, int x, int y, Uint32 pixel)
{
    Uint32 *const target_pixel = (Uint32 *)((Uint8 *)surface->pixels + y * surface->pitch + x * surface->format->BytesPerPixel);
    *target_pixel = pixel;
}

float dist(int x1, int y1, int x2, int y2)
{
    return sqrt(pow(x1 - x2, 2) + pow(y1 - y2, 2));
}

Point nearestPoint(struct Point (& points)[pointsN], int x, int y)
{
    float smallestSize = -1;
    float thisDist;
    int closestIndex;
    for (size_t i = 0; i < pointsN; i++)
    {
        thisDist = dist(x, y, points[i].x, points[i].y);
        if(thisDist < smallestSize || smallestSize == -1)
        {
            closestIndex = i;
            smallestSize = thisDist;
        }
            
        
    }
    return points[closestIndex];
}


main.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
#include "main.h"

int main(int argv, char **args)
{
    if (SDL_Init(SDL_INIT_EVERYTHING) != 0)
    {
        printf("Error");
        return 0;
    }

    const int w = 680;
    const int h = 480;

    SDL_Window *window = SDL_CreateWindow("SDL2 Window",
                                          SDL_WINDOWPOS_CENTERED,
                                          SDL_WINDOWPOS_CENTERED,
                                          w, h,
                                          0);
    if (!window)
    {
        std::cout << "Failed to create window\n";
        return -1;
    }

    /*SDL_Surface *window_surface = SDL_GetWindowSurface(window);

    if (!window_surface)
    {
        std::cout << "Failed to get the surface from the window\n";
        return -1;
    }
    */

    SDL_Renderer *renderer = SDL_CreateRenderer(window, -1, SDL_RENDERER_ACCELERATED);

    SDL_Texture *texture = SDL_CreateTexture(renderer, SDL_PIXELFORMAT_ABGR8888, SDL_TEXTUREACCESS_STREAMING, w, h);

    Uint32 pixels[w*h];

    Point points[pointsN];

    for (size_t i = 0; i < pointsN; i++)
    {
        Point newpoint;
        newpoint.x = rand() % w;
        newpoint.y = rand() % h;
        newpoint.color.color.r = rand() % 255;
        newpoint.color.color.g = rand() % 255;
        newpoint.color.color.b = rand() % 255;
        newpoint.color.color.a = 255;

        points[i] = newpoint;
    }

    SDL_Event e;
    bool keep_window_open = true;

    unsigned int frames = 0;
    Uint64 start = SDL_GetPerformanceCounter();

    //bool mouseDown = false;
    while (keep_window_open)
    {

        SDL_SetRenderDrawColor(renderer, 0, 0, 0, SDL_ALPHA_OPAQUE);
        SDL_RenderClear(renderer);

        while (SDL_PollEvent(&e) > 0)
        {
            switch (e.type)
            {
            case SDL_QUIT:
                keep_window_open = false;
                break;
            }
        }

        for (size_t y = 0; y < h; y++)
        {
            for (size_t x = 0; x < w; x++)
            {
                /*const unsigned int offset = (texWidth * 4 * y) + x * 4;
                pixels[offset + 0] = rand() % 256;     // b
                pixels[offset + 1] = rand() % 256;     // g
                pixels[offset + 2] = rand() % 256;     // r
                pixels[offset + 3] = SDL_ALPHA_OPAQUE; // a
                */
                pixels[(w * y) + x] = nearestPoint(points, x, y).color.color_int;
            }
        }

        for (size_t i = 0; i < pointsN; i++)
        {
            points[i].x += (rand() % 50 + 1) - 25;
            points[i].y += (rand() % 50 + 1) - 25;
        }

        SDL_UpdateTexture(
            texture,
            NULL,
            pixels,
            w * sizeof(Uint32));

        SDL_RenderCopy(renderer, texture, NULL, NULL);
        SDL_RenderPresent(renderer);


        frames++;
        const Uint64 end = SDL_GetPerformanceCounter();
        const static Uint64 freq = SDL_GetPerformanceFrequency();
        const double seconds = (end - start) / static_cast<double>(freq);

        printf("FPS: %f\n", frames / seconds);

        /*SDL_UpdateTexture(texture, NULL, pixels, w * sizeof(Uint32));

        SDL_RenderClear(renderer);
        SDL_RenderCopy(renderer, texture, NULL, NULL);
        SDL_RenderPresent(renderer);
*/
        //SDL_UpdateWindowSurface(window);
    }

    //delete[] pixels;
    SDL_DestroyTexture(texture);
    SDL_DestroyRenderer(renderer);

    SDL_Quit();
    return 0;
}
One trick you can try is to avoid using the square root and power functions, you just need to know what magnitude the functions would return rather than an explicit value of distance.

1
2
3
4
5
float dist(int x1, int y1, int x2, int y2)
{
    return ((x1 - x2)*(x1-x2) + (y1 - y2)*(y1-y2));
}


So this speeds it up to about 11 frames per second on my machine. Next thing is from line 78 to 90 in main try to un-nest some of the loops if you can... could you kill the nearestPoint function altogether and use the information from the two available loops in main()?
Last edited on
several things along the same vein.
instead of 3 calls to rand for color, make 1 call to get one int in range (0-256^3) and peel off the bytes.

nearest point ... whatever you are doing ... I didn't look closely but there is likely a way to sort the data (one time!) to get what you want rather than over and over look for what is nearest.
you can also maybe pregenerate random values -- calling rand or <random> stuff can be slow in a loop. Iterating a large shuffled array gets rid of that.
Last edited on
The most important thing is to compile it with optimizations (e.g., -O on g++) As noted, you can use integer math and avoid the floating point functions. I would also remove the extra comparison in the innermost loop (the || smallestSize == -1 part).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
int dist_squared(int x1, int y1, int x2, int y2)
{
    return (x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2);
}

Point nearestPoint(Point *points, int x, int y)
{
    int closestIndex = 0;
    int smallestSize = dist_squared(x, y, points[0].x, points[0].y);
    for (int i = 1; i < pointsN; i++)
    {
        int thisDist = dist_squared(x, y, points[i].x, points[i].y);
        if (thisDist < smallestSize)
        {
            closestIndex = i;
            smallestSize = thisDist;
        }
    }
    return points[closestIndex];
}


@jonnin,
instead of 3 calls to rand for color,..

That stuff is not in the main loop so it doesn't really matter.
Last edited on
I didn't run the code, but you're describing a voronoi diagram, right?
Calculating a distance measure at every pixel is going to be slow. There's an O(n log n) algorithm to generate a Voronoi diagram, where n is the number of points in the graph. The end result, I think, is that you are then able to draw triangles instead of individual pixels.
https://en.wikipedia.org/wiki/Voronoi_diagram#Algorithms

I have also seen people write GPU-side shaders to alleviate the CPU work.

As others have mentioned, optimizations can make a big difference when it comes to number crunching. If the code is formed well-enough, an optimizing compiler may be able to output assembly that takes advantage of SIMD instructs. Sometimes you can help the compiler a bit to encourage it to see the code in a vectorizable form.
https://easyperf.net/blog/2017/11/10/Tips_for_writing_vectorizable_code

IDK how outdated this information is, but see: https://gcc.gnu.org/projects/tree-ssa/vectorization.html
Vectorization is enabled by the flag -ftree-vectorize and by default at -O3. To allow vectorization on powerpc* platforms also use -maltivec. On i?86 and x86_64 platforms use -msse/-msse2. To enable vectorization of floating point reductions use -ffast-math or -fassociative-math.
Last edited on
I'm betting you would see another massive jump in performance if you follow Ganado's suggestion of moving the calculation into the graphics card and doing the calculations on a group (perhaps much better depending on the limits of your card and system bus). I don't think I can implement that to test today, but the code that was running at 11 fps with a default gcc compile for me became 43 fps with g++ -O3, which is an incredible boost as well.

Edit: finally, switching from float to ints bumped it up to 73 fps...
Last edited on
It's not just a matter of moving the calculation to the GPU. It's a matter of moving from something like an n-cubed algorithm to an n log n algorithm.

became 43 fps with g++ -O3

You should try -O and -O2 also. -O3 is not always fastest in my experience (although I'm not sure why).
Hmm... side quest here, do you guys think memoization could be applied in this instance? It feels like it could be since x is always between 0 and w, y is always between 0 and h... I'm getting ready for work right now, so I'm out of time to think about it...
I don't really see what you mean. Any given pixel could potentially become a different color on the next iteration.
The simplest thing would be to bake in the possible values of dist_squared(), 640x480=307,200 possible answers, hopefully small enough to always be in L1 or L2 cache. Seeing as its called about 15 million times per frame it might be nice to skip those multiplication calls (though would a fetch even from L1 be faster? I'm not sure.) I was also hoping to bake in all 50 checks against the positions of the voronoid, but as with the colors that is also randomly chosen. Not to mention changing randomly per frame. So yeah, it looks like only dist_squared() would be worth the cost, if it even helps to begin with.

I still haven't checked out that triangle algorithm yet, and haven't been able to do any further tests. Just on lunch break now.

Edit: Nope, just got back to my own computer and realized that dist_squared() takes four parameters, which squares(?) the number of possible answers in the table. And the V-diagram drifts apart as the animation goes longer, so things get worse as time goes on. Shucks.
Last edited on
a lookup table has to be faster than computing it. what is the lookup for that distance, how do you get it down to just an index? I think the answer unfortunately is as much or more work than just crunching it, and the math leaves the cache for something else. But I could be wrong.
Last edited on
This algorithm may be of interest:

a constant time algorithm on GPU to compute an approximation to the Voronoi diagram of a given set of seeds in a 2D grid. The errors due to the differences between the approximation and the actual Voronoi diagram are hardly noticeable to the naked eye in all our experiments.
https://www.comp.nus.edu.sg/~tants/jfa/i3d06.pdf
Topic archived. No new replies allowed.