Efficient string sorting.

My goal is to remove redundant strings from a list of hundreds of millions.

The solution I came up with is:

1. Use quick-sort on the list of strings
2. Pick the unique strings from the sorted list

Note that, during sorting, the strings are not moved around, only their pointers are.

I was wondering if quick-sort would be significantly faster, if I store the strings in a vector as opposed to adding them to the heap (see below):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Would this be more efficient...
    vector<string> strings_list;
    vector<string*> pStrings;

    // Fill strings_list with all the strings
    // Fill pStrings with pointers to strings in strings_list
    // Preform quick-sort on pStrings

// Than this...
   vector<string*> pStrings;

   // For each string (read from file) do:
   string * temp_pointer = new string( a_string );
   pString.push_back( temp_pointer );
   // Preform quick-sort on pStrings


I was wondering if making the addresses of the strings next to each other would allow quick-sort to run faster. I've been running the later method on 278,156,364 strings for about 11 hours now. I've also noticed that the latter method has slightly different run-times for the same data (~2,000,000 strings) - 136 seconds to 144 seconds.

By the way, the strings are, at most, a little over 15 characters long, and are derived from a 20 character alphabet.

Thanks in advance :)
Last edited on
How about just inserting all the strings into a hash?
(using the string as the key - whatever you like as the data)

you might be able to try unordered_map from C++11
Last edited on
std::set guarantees unique elements and has much faster retrieval by key search.

You know I thought of doing that, but I figured the map/hash would be too big to hold in memory. Wouldn't it?
Last edited on
You can use std::sort for sorting. It is probably faster than anything you write yourself. Have you tested to just sort the vector<string> directly?
1
2
std::vector<string> strings_list;
std::sort(strings_list.begin(), strings_list.end());
In C++11 I would expect this to be fast. To remove duplicates you can use erase and unique.
 
strings_list.erase(std::unique(strings_list.begin(), strings_list.end()), strings_list.end());


You could also try to use std::set<string> or std::unordered_set<string>. You just need to add all the strings to the set and you will automatically have no duplicates.
Last edited on
Also, if your compiler support move semantics you don't need to use pointers to string in a container, instead just toss it in a temproary string:

1
2
3
4
some_container<std::string> cs;
...
cs.push_back(std::string("one"));
cs.push_back(std::string("two"));


Watch this video: http://skillsmatter.com/podcast/home/move-semanticsperfect-forwarding-and-rvalue-references
Last edited on
You know I thought of doing that, but I figured the map/hash would be too big to hold in memory. Wouldn't it?

Erm yeah, talking about memory... exactly what hardware does this run on? 278 million strings can easily burn up 15 GB+, much more than that if you're allocating them via new. Does the machine have enough RAM?
Sorting 50 million strings takes 160 seconds for me (260s with pointer indirection), so 278 million should take nowhere near 11 hours.
I wonder how a trie would perform.
You might have to resort to a disc sort.

Sort the strings in chunks of ~10 million at a time, writing each chunk to a file.

Then merge the files writing out the sorted/uniqued data as you go

11 hours => the machine is swap thrashing?
The process has 47 GB of RAM available. Why would allocating via new require more memory than 15 GB? Wouldn't it just require the same amount of memory that each string needs?

Are you using std::sort on vector<string> to get 160 seconds? I thought of using std::sort like that, but I figured it would be to costly to swap strings from one location to another. If you did use std::sort, then I confused as to why it took longer with pointer indirection.

I'm going to try std::sort on my end.
Are you using std::sort on vector<string> to get 160 seconds?

Yes.

I thought of using std::sort like that, but I figured it would be to costly to swap strings from one location to another.

It's not costly at all, the strings just swap their internal pointers.

If you did use std::sort, then I confused as to why it took longer with pointer indirection.

With the unnecessary dereferencing it takes longer, nothing surprising about that.
Wow. I thought the std::sort algorithm swapped the strings... opps

Just used std::sort, took 4 seconds.

Darn it! Wasted time :-/

Why didn't I just test std::sort first...
278 million strings can easily burn up 15 GB+
¿How did you get that number?
300e6*15 = 4.5e9
I think that a map will at most duplicate that value.
300e6*15 = 4.5e9

? std::string has significant overhead.
The 50 million strings (9.5 characters in average) take up 2.5 GB in a 64-bit program.
I can't believe that it triple its size.
I've got sizeof(string) == 8 so that will mean a 50% overhead (6.75e9)

¿What are you using to measure it? pmap is giving me crap.
so that will mean a 50% overhead (6.75e9)

std::string dynamically allocates the memory for the string (+its internals, at least in the GNU implementation), which introduces further overhead.
The penalties are smaller for 32-bit applications, though. When compiling with -m32, the same 50 million words take up just 1.6 GB.

¿What are you using to measure it?

Just gnome-system-monitor.
Last edited on
I forgot that the pointers need to be deleted, xP.
However vector<string> v( 300e6,string(15,'*') ); is using only 1.1GiB in 32-bit ¿?
(pmap and gnome-system-monitor report the same)
Hm yeah, the GNU implementation uses a copy-on-write mechanism for strings, so all those strings actually share the same data initially. But if you do something like this, the situation changes:

1
2
  vector<string> v( 300e6,string(15,'*') );
  for (string& s : v)s[0]=s[0];
Topic archived. No new replies allowed.