All compression works by using probabilities.
For example, take the text "Hello world!". That's composed of twelve ASCII characters, all eight bits in length (seven, technically, but the characters are stored in eight-bit bytes). The first thought is that there are way too many bits in use per character. There's only nine distinct letters used in the message -- we can fit each character in only four bits. That effectively halves the space used.
But we can do better. The trick is that some characters appear more often than others. In this example, there are three 'l's and two 'o's. Since it is more likely that an 'l' or an 'o' will appear in the message than any of the other letters, what we can do is redistribute the number of bits needed to store each character so that letters that appear
more often use
fewer bits.
'l' 00
'o' 01
'H' 100
'e' 101
' ' 1100
'w' 1101
'r' 1110
'd' 11110
'!' 11111
|
Now our original message of 48 bits (using four bits per character) is now only 38 bits long. That's a pretty good start.
This is the basis of
all data compression.
(Notice that the bit patterns are specifically designed such that it is impossible to have ambiguities. You can see this by drawing it as a binary tree, where a 0 leads to a left branch and a 1 leads to a right branch. The exact methods of coming up with these binary trees is not important in this particular discussion, but you may want to google around "Huffman coding" for some examples.)
LZW works by finding common patterns ("strings" of bits) and representing them with shorter strings. For example, the word "he" appears often in English. "W
hen t
he dog followed t
hem to t
he store t
hey bought it licorice." Fancy noting that the "he" pattern is also in another couple of patterns: "the" and "he ".
In order to keep track of all these fancy bit-patterns, we must keep a
dictionary. (In C++ that would be a std::map.) Also, the decoder must have a copy of the dictionary the encoder used in order to understand the message.
LZW uses a trick to embed the dictionary in the data stream. That is, when a new pattern is found and added to the dictionary, a special bit code is outputted, followed by information about the pattern it represents. The encoder uses this information to build its dictionary.
Also, the LZW starts out with a minimum number of bits, and the code word size (number of bits per symbol) may grow as needed. Both the encoder and decoder automatically detect this condition.
However, there are also other special codes used to signal that the decoder should behave in a certain way (for example, to throw away the dictionary and start over).
Finally, a GIF LZW stream is packed in to packets of up to 255 (IIRC) bytes, which your decoder must first unpack before sending the raw bitstream to the actual decode. Oh, and did I mention that you must be able to read a single bit at a time?
For more help, google around "wilhite" and check out the links at the bottom of the Wikipedia page.
http://en.wikipedia.org/wiki/Graphics_Interchange_Format
Good luck!