Using multiple cores

I'm just looking for some quality information about writing code that uses multiple cores. I'm not sure where to start. Initially, I want to be able to parse data from a text file into a vector much faster than I'm currently doing so.

My understanding is that uses more cores for that process will definitely make it quicker. Any help with doing so would be much appreciated.

Thanks
First, you need to figure out where the bottle necks really are before you get any idea in your head like "I know, I'll use threads".

1
2
3
4
5
vector<string> lines(1000000);
ifstream ifs("monster.txt");
string line;
while( getline(ifs,line) ) lines.push_back(line);
cout << lines.size();

Do this for a file containing many millions of lines. You want it to take at least 30 seconds of real time, so keep doubling the size of monster.txt until it does.

While it's running, you look at your "task manager" CPU and Disk performance.

Unless the CPU is solidly nailed to 100%, threads are not going to help you.

If your disk is a spinning disk, then that will be your bottleneck for sure.
SSDs are better, but still slow compared to how long it takes a CPU to talk to DRAM.
My intuition would assume that threading doesn't even help with disk access since there's only one disk to read from. Maybe that's different with SSDs but seems unlikely to me.

I think the fastest way to handle something like this would be making a memory-mapped file, push the raw bytes to it, and then parse that memory using threading.
Last edited on
threads kick off additional cpus usually, and to get that to work well you need a task that can be divided across cores. A good example of this is parsing a file where each line is one entity. A bad example of this is xml, where blocks are random sizes and many lines make up a chunk that you want to examine -- there isnt a good way to send chunks off to other processors. Binary files with fixed sized records are really efficient with threads.

large files as stated are usually best memory mapped. I don't know if it makes much difference for small files.

Threading can help files if the disk system supports it. If not you just get a lot of cpus waiting on the disk, without any real gains. This is hardware driven.

Threads cost a little to create and destroy. Use on too small a problem is actually a performance hit rather than a gain. EG if you wanted to add 8 numbers together into 4 results, adding them in threads is slower than just doing it in one cpu, even though it splits the work.
The sweet spot there is hardware dependent.

I guess what you may be asking can be answered if you look up the terms I/O bound vs CPU bound problems. Cpu bound problems excel with threads. I/O bound problems need more analysis.
Thanks for the information. I wasn't sure if multiple cores would help with the parsing since it was reading for a disk, in my mind it was the idea of "oh if core 1 parses line 1 from the text while core 2 processes line 2, etc., then it would parse the whole file so much quicker".

I wasn't entirely sure how that would look, and even if it would be effective. But it seems it's not worth the extra effort, unless I'm mistaken on that as well.

The files I'm parsing are many millions of lines, but I just wasn't sure if how I was doing it was the most effective way.
again, it depends. If the parsing is line by line, and its very slow, then breaking it up has merits. Before bothering with this, you need to prove to yourself that your file is being read faster than it is being parsed. If the parser is mostly waiting on the next line of the file, it won't help. If you prove that, you still are not ready to thread up: you should take a quick pass at making sure the file reader is going as fast as it can too. If you can't get the file reader faster and your parsing is slower than reading, then its time to look at your above threaded approach.
Last edited on
Some implementations of C++ streams are very slow. Consider using C I/O instead (fopen, fclose, fread, fwrite). But make sure you give it a large enough buffer. You could also use open/close/read/write, or whatever the OS's I/O interface is.

You goal is to parse the file as fast as the disk system can supply it. As a baseline test, try copying the entire file to /dev/null (or equivalent) and see how long that takes. This times how fast the OS can read the file.
Some implementations of C++ streams are very slow.

You could also try turning off syncing between C++ and C streams:

1
2
3
4
5
6
// Turn off sync (call early in program, before any I/O happens)
std::ios_base::sync_with_stdio(false);

// This one may not be needed:
// untie cin from cout (so cin doesn't automatically flush cout)
std::cin.tie(nullptr);

try copying the entire file to /dev/null (or equivalent) and see how long that takes.

Granted, the operation might complete much faster when the page cache is already warm.
@dhayden's point is key here, so I'd like to underscore it a bit.

Sometimes the correct choice is a different method (algorithm, usually). When I insisted on concrete tests of this years ago, I found that on modern hardware (this side of the 3Ghz CPU), the most you'd be able to read from a file using C++ streams is about 150 MBytes/sec. It just never moved faster than that.

Switching to the C family (fread/fopen), the rate jumped to 600 Mbytes/sec.

Switching to memory mapped file I/O pulled in over 900 Mbytes/sec.

All of that in single threads.

Now, when reading files, consider what threading would do, and how it would do it. One method is to divide the file into sections, handing each thread it's own section. Unfortunately, that would have each thread thrashing the read/write head of an hard disk over the drive, slowing the work down considerably. Even on an SSD, this tends to clog up the queue of instructions sent to the drive, and you a similar problem (highly dependent on the drive and controller).

On the other hand, a lot depends on the parser code itself. @Salem C pointed out that if there is 100% CPU usage, then the parser itself is the problem. That's true, but I'd like to add the caveat that saturation may be on one thread, so that's 100% of that thread doing the parsing (not necessarily 100% of the entire CPU cores).

If, and that's a big if, the parser can be written as a set of parallel operations, then one thread reading and distributing to a number of workers doing the parsing might be a good construction.



Topic archived. No new replies allowed.