I'm working on a site (among other things) ATM, and I'm using FAROO's symspell [1] algorithm for correction and autocompletion purposes. There is a thorough description on their blog, but, in short, the algorithm gets user input, generates deletes up until a certain edit distance (e.g. 'play' -> ['lay', 'pay', 'ply', 'pla'], for max_edit_dist == 1) and looks them up in a dictionary. That dictionary's keys are also deletes of terms, and its values are lists of suggested corrections (e.g. dictionary['sn'] == [23, 56, 89], where the numbers are the IDs of, say, 'sun', 'son' and 'sin'). I have a working javascript (I'm using node) implementation, but it consumes too much memory (around 1GB). I really don't want to move to disk I/O, because the nature of the site is such that the searching module is a fundamental part of it, and I want to keep it as fast as possible. But I don't want to pay for the extra RAM either. So, what do I do?
The first thing I thought I should change was the character encoding. AFAIK, javascript uses UTF-16, which is quite wasteful for my needs. I just want to encode about 80 - 90 different values, so 7 bits per character are more than enough. But 7 bits are able to encode 128 values, so what happens to the extra stuff? I can use them to encode frequent pairs or triples of characters. And since I'm at it, I can also apply Huffman afterwards. The result looks quite OK; on my data, the average new word size is around 0.325 times the original. Here's what the procedure looks like on part of a different dataset:
word encoding symbols bits per symbol total total original ratio
bits bytes bytes
abra a/b/ra 4/6/6 16 2 / 8 == 0.25
aerodactyl a/e/ro/d/a/ct/y/l 4/4/7/6/4/7/7/5 44 6 / 20 == 0.3
alakazam a/l/a/k/a/z/a/m 4/5/4/6/4/8/4/6 41 6 / 16 == 0.375
arbok ar/b/o/k 6/6/5/6 23 3 / 10 == 0.3
arcanine ar/c/an/in/e 6/5/6/7/4 28 4 / 16 == 0.25
articuno ar/t/i/c/u/n/o 6/4/5/5/6/5/5 36 5 / 16 == 0.313
beedrill b/ee/d/ri/l/l 6/6/6/7/5/5 35 5 / 16 == 0.313
bellsprout b/el/l/s/p/ro/u/t 6/6/5/5/6/7/6/4 45 6 / 20 == 0.3
blastoise b/l/as/to/i/s/e 6/5/7/7/5/5/4 39 5 / 18 == 0.278
bulbasaur b/u/l/b/a/sa/u/r 6/6/5/6/4/7/6/4 44 6 / 18 == 0.333
butterfree b/u/t/te/r/f/r/ee 6/6/4/6/4/6/4/6 42 6 / 20 == 0.3 |
Another thing I can do is clever data structure stuff to save some MB wasted on pointers and book-keeping variables. Suppose I use several
std::map<MyString<ByteCount>, std::vector<MyInt>>s to store the dictionary (I'll switch from js to C++, since node supports C++ 'addons'). Once I finish building my maps, I don't need to modify them anymore, so I can switch to sorted vectors of keys and vectors of values. This will save me the pointers of the left and right children of each node in the trees. I can also save some space by concatenating the value vectors. This is easier explained visually:
Instead of...
dictionary['a'] == [1, 2] (many vectors)
dictionary['b'] == [3, 4, 5] (64 bit pointer + size etc... overhead)
dictionary['c'] == [6]
dictionary['d'] == [7, 8]
[...] |
I can have...
dictionary['a'] == 0 (just an index, much less than 64 bits, BTW)
dictionary['b'] == 2
dictionary['c'] == 5
dictionary['d'] == 6
[...]
[1, 2, 3, 4, 5, 6, 7, 8, ...] (just one vector) |
What else can I do? Or is there maybe another, better way to approach this?
Thanks.
[1]
https://github.com/wolfgarbe/symspell