I have huge text files that I would like to sort... they range from 90,000 KB and 120000 KB. I have an access to a machine with 32 GB... what would be the best way to do the sorting?
for (vector<string>::iterator it=myvector.begin(); it!=myvector.end(); ++it)
outdata << *it << endl;
indata.close();
outdata.close();
return 0;
}
It works perfectly with small text files... I'm looking for a better code with huge text files... Actually, the above code is still running on my huge text file... I really don't know if it can handle it or not but in any case, I would really appreciate if someone can give me a better way...
This is pretty much what I was going to suggest that you try, before refreshing the page and noticing your second post.
I really don't know if it can handle it or not but in any case, I would really appreciate if someone can give me a better way...
For performance your current code is already very good, in my opinion.
What you could do would be to reserve() memory in myvector.
This should improve performance a bit because repeatedly using push_back() won't cause memory reallocations.
vector<string> myvector;
myvector.reserve(HOW_MANY_WORDS);
// where HOW_MANY_WORDS is approximately the
// maximum number of words you expect to read
Other minor suggestions would be, clean your code a little bit, and be sure to turn on compiler optimizations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
ifstream indata("file1.txt");
ofstream outdata("file2.txt");
string line;
vector<string> myvector;
myvector.reserve(1000); // you can probably set this higher, if you have 32 GB of RAM
indata.open("file1.txt");outdata.open("file2.txt");
// ...
for (vector<string>::const_iterator it=myvector.begin(); it!=myvector.end(); ++it)
outdata << *it << endl;
indata.close();outdata.close();
See your IDE's or compiler's documentation about how to turn on optimizations.
With GCC, you pass the -O3 argument to g++.
With Visual Studio, you build in Release mode.
Just as a side note, for very large files, std::vector is not the best choice. Use a std::deque instead. It won't cost you significant performance (if any), and will work better with system memory.