theranga wrote:
Your algorithm with the set would also require O(n log n) time to load the data, so its overall efficiency would be O(n log n) |
The whole process does have to be done n number of words times - so this is what you were saying with O(n log n), but it's not as simple as that. I was talking about the efficiency of the find algorithm, not the overall efficiency which is very different.The frequency of the words is unknown and variable - depends on the file being analysed.
And I proposed using an array of 26 maps, not a set.
So for some scenario's.
1. File size - 1 million words. Data structure - 1
<set >
of all the words.
The amount of time taken varies for each word added, and depends of whether the word is already in the set or not, and on how many words there are already.
The time taken to analyse a word varies because:
- The set starts with 1 word and increases up to it's maximum, so n varies;
- If the word is not in the set - this takes the longest, but the find algorithm run by the insert function is O(log n); if the word is already there, then it takes a somewhat less time to find it, but could be O(log n/2) on average;
- The frequency of the words means there will be much less words in the set than what is in the file - say there were 500 different words in the file - so max map size is 500 not 1,000,000.
2. To make the math easy say there are 500 distinct words in the file and they all have the same frequency (2,000). Furthermore the distribution across the word's first letter is the same (~20). It's very hard to guess what these might actually be, but the logic is the same. Also we now use the data structure:
std::map<std::string, std::vector <unsigned int> > WordMapArray[26];
The ideas mentioned in item 1 still apply, but now each map has only 20 words in it, with frequencies of 2,000. So this much better than a map with 500 items.
3. The same as scenario 2, but the distributions are more realistic. So now for words whose first letter is much less frequent, the map will have much less items - maybe none, so this provides quite a saving, but is offset somewhat by words with a higher first letter frequency.
It would be interesting to compare
cire's suggestion of an unordered set, which uses hashing for it's algorithms. I guess it depends on the hashing strategy - I would not be least surprised if it was better than my little idea.
Hope all goes well.