What I think happened is that you have 4 cores (maybe not), and when the first job finished it did the 5th file, but the other 3 threads were waiting in a spinlock with the yield (you can google "yield using high cpu").
But the fact you can't reproduce your problem is disturbing. I still recommend you to use sleep_for instead of yield because you aren't doing anything that is time intensive (which yield is good for since sleep_for is very inaccurate and lags much longer than 1ms, but gives the CPU some air).
I also don't understand how the single threaded version is inefficient, what speed did the multi threaded version get, was it 10 seconds? It really doesn't make sense because of the caching, if you use getline you get a small chunk of the full file, but caching should make accessing the rest of the data as fast as reading ram (for the allocation granularity size).
I may be wrong, maybe getline is much more inefficient than I think, but there really is no other easy way around it, even something more bare like finding the delimiter yourself, a 2x speedup isn't that big.
Are you using more than 1 disk with RAID 0 or RAID 5?
Can you do a test like this (you can use the logic of Dr Dobbs source test for simplicity):
Restart your computer, try multiple times
1 2 3
|
//time start
open_file("1.txt");
//time end
|
Then restart your computer, try multiple times
1 2 3 4 5 6
|
//time start
std::thread t1([](){open_file("1.txt");});
std::thread t2([](){open_file("2.txt");});
t1.join();
t2.join();
//time end
|
Don't forget to try tell us whats in them.