simple and fast encoding scheme to massively reduce dictionary size?

I'm working on a site (among other things) ATM, and I'm using FAROO's symspell [1] algorithm for correction and autocompletion purposes. There is a thorough description on their blog, but, in short, the algorithm gets user input, generates deletes up until a certain edit distance (e.g. 'play' -> ['lay', 'pay', 'ply', 'pla'], for max_edit_dist == 1) and looks them up in a dictionary. That dictionary's keys are also deletes of terms, and its values are lists of suggested corrections (e.g. dictionary['sn'] == [23, 56, 89], where the numbers are the IDs of, say, 'sun', 'son' and 'sin'). I have a working javascript (I'm using node) implementation, but it consumes too much memory (around 1GB). I really don't want to move to disk I/O, because the nature of the site is such that the searching module is a fundamental part of it, and I want to keep it as fast as possible. But I don't want to pay for the extra RAM either. So, what do I do?

The first thing I thought I should change was the character encoding. AFAIK, javascript uses UTF-16, which is quite wasteful for my needs. I just want to encode about 80 - 90 different values, so 7 bits per character are more than enough. But 7 bits are able to encode 128 values, so what happens to the extra stuff? I can use them to encode frequent pairs or triples of characters. And since I'm at it, I can also apply Huffman afterwards. The result looks quite OK; on my data, the average new word size is around 0.325 times the original. Here's what the procedure looks like on part of a different dataset:

word         encoding symbols        bits per symbol     total  total   original   ratio
                                                          bits  bytes    bytes

abra         a/b/ra                  4/6/6                 16     2   /    8   ==   0.25 
aerodactyl   a/e/ro/d/a/ct/y/l       4/4/7/6/4/7/7/5       44     6   /   20   ==   0.3  
alakazam     a/l/a/k/a/z/a/m         4/5/4/6/4/8/4/6       41     6   /   16   ==   0.375
arbok        ar/b/o/k                6/6/5/6               23     3   /   10   ==   0.3  
arcanine     ar/c/an/in/e            6/5/6/7/4             28     4   /   16   ==   0.25 
articuno     ar/t/i/c/u/n/o          6/4/5/5/6/5/5         36     5   /   16   ==   0.313
beedrill     b/ee/d/ri/l/l           6/6/6/7/5/5           35     5   /   16   ==   0.313
bellsprout   b/el/l/s/p/ro/u/t       6/6/5/5/6/7/6/4       45     6   /   20   ==   0.3  
blastoise    b/l/as/to/i/s/e         6/5/7/7/5/5/4         39     5   /   18   ==   0.278
bulbasaur    b/u/l/b/a/sa/u/r        6/6/5/6/4/7/6/4       44     6   /   18   ==   0.333
butterfree   b/u/t/te/r/f/r/ee       6/6/4/6/4/6/4/6       42     6   /   20   ==   0.3

Another thing I can do is clever data structure stuff to save some MB wasted on pointers and book-keeping variables. Suppose I use several std::map<MyString<ByteCount>, std::vector<MyInt>>s to store the dictionary (I'll switch from js to C++, since node supports C++ 'addons'). Once I finish building my maps, I don't need to modify them anymore, so I can switch to sorted vectors of keys and vectors of values. This will save me the pointers of the left and right children of each node in the trees. I can also save some space by concatenating the value vectors. This is easier explained visually:

Instead of...

dictionary['a'] == [1, 2]  (many vectors)
dictionary['b'] == [3, 4, 5] (64 bit pointer + size etc... overhead)
dictionary['c'] == [6]
dictionary['d'] == [7, 8]
[...]

I can have...

dictionary['a'] == 0 (just an index,  much less than 64 bits, BTW)
dictionary['b'] == 2
dictionary['c'] == 5
dictionary['d'] == 6
[...]

[1, 2, 3, 4, 5, 6, 7, 8, ...] (just one vector)

What else can I do? Or is there maybe another, better way to approach this?

Thanks.


[1] https://github.com/wolfgarbe/symspell
Last edited on
The costs are:

1) generating entropy for potential word matches
2) managing a dictionary

Since there isn't much you can do to offload the entropy matching, I suggest you reduce your dictionary's size.

I would thin the dictionary by removing all deletes not at the beginning of the word.
That is, for 'play', the only delete you should have in the dictionary is 'lay'.

I would also put something of a pre-filter on choices so that you need not choose words that couldn't possibly be matches (like 'play' and 'pancake').

Finally, just run through the list, seeking the proper match.

Does your algorithm account for things like 'paly'? Swapped letters are a huge misspelled/mismatch issue.

Hope this helps.

Thanks for the reply.

Does your algorithm account for things like 'paly'?

Of course. 'paly' generates 'pay' and 'ply', which are also generated by 'play', which implies that 'play' will be a suggestion for 'paly'. A detailed explanation can be found here -> [1].

I would thin the dictionary by removing all deletes not at the beginning of the word.
That is, for 'play', the only delete you should have in the dictionary is 'lay'.

If I did that I wouldn't be able to account for e.g. swapped letters :P Well, I could, but I would have to generate more than just deletes from the user's entry in order to match something in the dictionary. This begins to sound like Peter Norvig's approach, which is orders of magnitude slower than symspell. If I decided to do that, I might as well decide to do disk I/O (which I want to avoid).

I believe the algorithm itself is fine as it is. It was designed to be fast at the cost of consuming a lot of memory. I guess what I'm really looking for is some bleeding-edge-blazingly-fast-ultra-compression thing. I discovered Finite State Entropy yesterday [2, 3], I'll give it a try once I manage to digest the paper. What worries me is that it seems to require augmenting the compressed string with the state of an automaton. This can be a problem for me, since I'm working with short strings.

However, the thinning suggestion just made me notice that I have more data duplication than I thought. Apart from the main dictionary, the keys of which are words, prefixes of words (required for autocompletion) and deletes of these things, I also need to maintain a wordlist, that simply lists words and their prefixes (no deletes here), a word2index mapping (inverse wordlist) and an abbreviation mapping (maps IDs of prefixes to IDs of whole words). It is entirely possible to store words and their prefixes once instead of twice. I think this might just be enough.

Thanks again!


[1] http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/
[2] https://github.com/Cyan4973/FiniteStateEntropy
[3] http://arxiv.org/pdf/1311.2540.pdf
Last edited on
> It was designed to be fast at the cost of consuming a lot of memory.
But you don't want to use that much memory.
Perhaps we could help better if we know the conditions of your problem, like average word lenght, dictionary size and usage of the program.
But you don't want to use that much memory.

I was under the impression I was clear about this, but apparently I wasn't. If I have to pay for the extra RAM, I will. But if I don't have to, why do it? And even if I end up paying, why not put it to better use if I can? To sum up, my priority here is speed. Now, if I manage to significantly reduce memory consumption without severely compromising speed, that's a bonus.

Perhaps we could help better if we know the conditions of your problem, like average word lenght, dictionary size and usage of the program.

Sure.

Suppose I feed the program a file with one term per line, looking like this:

bulbasaur grass poison
ivysaur grass poison
venusaur grass poison
charmander fire
charmeleon fire
charizard fire flying
squirtle water
wartortle water
blastoise water
[...] 

i.e. the first 151 pokemon with their respective types, taken from:
http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number

Then, I can interact with it like:

enter something: pikatsu
did you mean...

pikachu electric 

???

enter something: pi
did you mean...

pidgeot normal flying
pidgey normal flying
pidgeotto normal flying
pikachu electric
pinsir bug 

???

enter something: pisnir
did you mean...

pinsir bug 

???

enter something: pirsin
did you mean...

persian normal
pinsir bug 

???

enter something: n pirsin
did you mean...

persian normal

???

enter something: cha
did you mean...

charmeleon fire
charizard fire flying
charmander fire 
chansey normal 

???

enter something: fly char
did you mean...

charizard fire flying 

???

enter something: fir fly
did you mean...

charizard fire flying
moltres fire flying 

???

enter something: fir fl
did you mean...

flareon fire
charizard fire flying
moltres fire flying 

???

enter something: posin watre
did you mean...

tentacruel water poison
tentacool water poison 

???

enter something: psy watr
did you mean...

slowpoke water psychic
starmie water psychic
slowbro water psychic
psyduck water 

???

enter something:
[...]

Notice that I apply the algorithm on two different levels: (a) on individual words and (b) on whole terms. When resolving 'fir fl', for example, each word is examined separately at first: 'fir' resolves to ['fire'] and 'fl' to ['flareon', 'flying']. Then, ['fire'] maps to the set of fire pokemon, and ['flareon', 'flying'] to the set of flying pokemon including flareon. Finally, the suggestions consist of the pokemon that appear in these (2 here) sets the most times (initially, I had used set intersection, but this would return no suggestions for 'pika ice', while I wanted to get pikachu and the ice pokemon).

Some stats now.

For the pokemon dataset used in the example:

alphabet size: 27

[word stats]

unique words: 168

              absolute   relative
     length  frequency  frequency

       3         4       0.0238
       4         6       0.0357
       5        15       0.0893
       6        34       0.2024
       7        43       0.256
       8        28       0.1667
       9        24       0.1429
      10        14       0.0833

wordlist (words and prefixes) size: 1196
dictionary (words, prefixes and deletes) size: 17470

[term stats]

unique terms: 151

              absolute   relative
     length  frequency  frequency

       2        84       0.5563
       3        67       0.4437
       
termlist (just terms) size: 151
dictionary (just abbreviations) size: 168

For my real dataset:

alphabet size: 89

[word stats]

unique words: 4052

              absolute   relative
     length  frequency  frequency

       1        35       0.0086
       2        49       0.0121
       3       157       0.0387
       4       322       0.0795
       5       447       0.1103
       6       565       0.1394
       7       557       0.1375
       8       524       0.1293
       9       484       0.1194
      10       312       0.077
      11       234       0.0577
      12       162       0.04 
      13        70       0.0173
      14        53       0.0131
      15        38       0.0094
      16         8       0.002
      17        13       0.0032
      18        13       0.0032
      19         2       0.0005
      20         2       0.0005
      21         1       0.0002
      22         1       0.0002
      23         1       0.0002
      24         2       0.0005
      
wordlist (words and prefixes) size: 30599
dictionary (words, prefixes and deletes) size: 779544

[term stats]

unique terms: 5736

              absolute   relative
     length  frequency  frequency

       1      1067       0.186
       2      1411       0.246
       3      1237       0.2157
       4       960       0.1674
       5       522       0.091
       6       284       0.0495
       7       128       0.0223
       8        62       0.0108
       9        37       0.0065
      10        14       0.0024
      11         7       0.0012
      12         3       0.0005
      13         1       0.0002
      14         3       0.0005
      
termlist (just terms) size: 5736
dictionary (just abbreviations) size: 4052

Note that I don't use prefixes as abbreviations for terms.
Instead, each word acts as an abbreviation for every containing term.

EDIT: Mmm, it turns out that I don't really need a full-blown application of the algorithm on the term level. There's no need to store deletes of terms. Just mapping each word to every containing term is enough.
Last edited on
Topic archived. No new replies allowed.