I know this is a loaded question so I'm just looking for general advice on what areas of programming I need to study. I already have a good grasp of multithreading but I don't think that alone is a solution.
Let me give an example that is also the reason I'm making this thread:
I wrote a media manager/viewer/player. It had a nice GUI and everything was functional using worker threads to keep the UI updated. Then later I downloaded Google's Picasa, and in comparison my program was sooo slow. Picasa displays thumbnails for multiple folders instantly, and when you start Picasa the folder view is instantly filled. And this is all with super fast startup time. Whereas in my program I have to sit and watch as each item(tree item & thumbnails) are added one by one because I don't know what to do besides enumerate this information one by one.
So yeah, I will greatly appreciate any advice I can get here...
Picasa displays thumbnails for multiple folders instantly
I don't currently use Picasa, so I cannot comment too much on how it works.
However, a lot of the impression of high-performance may be due to pre-caching of thumbnails, as well as other information, in a separate file or files. The alternative to caching is to re-generate the information each time. For example to open each individual image file and extract or generate a thumbnail, for a single image is ok, but for large numbers of images requires a lot of processing.
Picasa does have a database file that contains thumbnails but that doesn't explain how the thumbnails seem to always be instantly ready. Even with a database file you'd still have to process the images individually to insert them in the UI, right?
I have been wondering if using memory-mapped data files is the best thing to do here. I'm not really sure what use memory-mapped data files are tbh.
@htirwin
I made a huge mistake of using GDI+ for this. I think if I revisit this project I'll most definitely switch to something that's hardware accelerated.
... the thumbnails seem to always be instantly ready.
I guarantee that this application is using some form of pre-caching. There usually isn't any magic involved with stuff like this, cheaters do in fact prosper.
I really doubt your render code is the bottleneck -- even using software blitting, you should be able to render dozens of small images in the time it takes to load just one from the hard disk. Optimizing your renderer probably won't do much (though it's a good idea to use hardware acceleration if you can).
You may be able to get some extra speed with threading. Make a stack of all the images you need to load, and create a pool of threads which pop a filename off the stack, load the image, and push a pointer to it onto a different stack. Then your render thread just polls the pointer stack and every time a new one gets pushed, it pops it off, draws the image and then goes back to polling.
I don't know if that will really make much difference though since the hard disk can only really read from one location at a time anyway.
My guess is that Picasa is using embedded JPEG thumbnails. I don't know if this completely accounts for the speedup, though, as I don't know how frequently encoders (e.g. cameras, scanners, etc.) embed the thumbnail.
I don't use Picassa, but most user's systems have Explorer creating thumbnails already. Is it possible that it is just using the existing Thumbs.dat database?
Well, for starters: in C++/C. You need to know how to optomize your program, and in C++/C, that doesn't always mean "more threads". A lot of the time, it means reducing the number of instructions in your algorithms, or reducing the number of variables you're using at any one time.
I adamantly stick to the rule of 'one thread per core' for every application. So on my 8 core CPU, it speeds up execution 8x. If I run code through my GPU, I get about an 1100x speed increase.
Also using 64 bit variables wherever you can will increase speed by 64x, though unfortunately it increases your program size by 64x also. See, it's always a tradeoff.
I adamantly stick to the rule of 'one thread per core' for every application. So on my 8 core CPU, it speeds up execution 8x. If I run code through my GPU, I get about an 1100x speed increase.
Waaaaat?
Also using 64 bit variables wherever you can will increase speed by 64x
Waaaaat?
though unfortunately it increases your program size by 64x also. See, it's always a tradeoff.
I adamantly stick to the rule of 'one thread per core' for every application. So on my 8 core CPU, it speeds up execution 8x. If I run code through my GPU, I get about an 1100x speed increase.
Well, this increase isn't always guaranteed since it depends on which core the OS puts it on.
Also using 64 bit variables wherever you can will increase speed by 64x
I think he meant that using variables that are equal to the system's memory bandwidth (for the lack of a better term) can speed up the execution of the program.
though unfortunately it increases your program size by 64x also. See, it's always a tradeoff.
I honestly don't know what he meant here. Maybe that the variables take up more space due to their larger size.
Even if the OS schedules it perfectly, it probably won't scale linearly.
Also, it's not guaranteed to be faster just by using a certain sized variable. The compiler generally provides types to determine the fastest size. For systems implementing stdint.h, it would be the ?int_fast{8,16,32,64,128}_t types. However, note that in some cases, they're the exact same size as their required width counterparts.
As a side note, targeting x86-64 instead of x86 can sometimes increase performance for code with heavy integer crunching, because of the extra general purpose registers. The exact results will vary, of course, but I've seen programs get 20% faster by just rebuilding.