My goal is to remove redundant strings from a list of hundreds of millions.
The solution I came up with is:
1. Use quick-sort on the list of strings
2. Pick the unique strings from the sorted list
Note that, during sorting, the strings are not moved around, only their pointers are.
I was wondering if quick-sort would be significantly faster, if I store the strings in a vector as opposed to adding them to the heap (see below):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|
// Would this be more efficient...
vector<string> strings_list;
vector<string*> pStrings;
// Fill strings_list with all the strings
// Fill pStrings with pointers to strings in strings_list
// Preform quick-sort on pStrings
// Than this...
vector<string*> pStrings;
// For each string (read from file) do:
string * temp_pointer = new string( a_string );
pString.push_back( temp_pointer );
// Preform quick-sort on pStrings
|
I was wondering if making the addresses of the strings next to each other would allow quick-sort to run faster. I've been running the later method on 278,156,364 strings for about 11 hours now. I've also noticed that the latter method has slightly different run-times for the same data (~2,000,000 strings) - 136 seconds to 144 seconds.
By the way, the strings are, at most, a little over 15 characters long, and are derived from a 20 character alphabet.
Thanks in advance :)