file compression

Dec 2, 2011 at 5:26pm

I am trying to write the Huffman code for file compression. Although I understand the basic idea of enumerating the characters and assigning the new bit codes, I am confused as to how to actually compress a file, say for example a text file containing some characters eg. aaabb. Do I write the output as another text file containing 0's and 1's or do I create a binary? Because in a text file 0's and 1's are 1 byte and as big as a's and b's.

Thanks

Dec 2, 2011 at 5:27pm

Peter87 (11251)

I think you should use binary encoding

Dec 2, 2011 at 5:43pm

targt123 (30)

I am confused on this, lets say I have an example.txt which contains aaabb and I assign code 0 to a and code 1 to be, the resulting compressed file would be 00011. If I write a text file with 00011, it is the same size as the original aaabb (each 5 bytes), so where is the compression?

Dec 2, 2011 at 5:43pm

LB (13399)

00011 is 5 bits, not 5 bytes. A byte is 8 bits.

Dec 2, 2011 at 5:59pm

targt123 (30)

I have two text files, one with text www and other with int 111, when I click on the properties of the files I see the same size:

Size : 3 (3 bytes)
Size on Disk: 4.00 KB

code:

ofstream myfile;
  myfile.open ("example.txt");
  myfile << "www";
  myfile.close();

  ofstream myfile1;
  int i = 111;
  myfile1.open ("example1.txt");
  myfile1 << i;
  myfile1.close();

Dec 2, 2011 at 6:11pm

LB (13399)

You need to save in binary format, you are currently saving that integer in string format.

Dec 2, 2011 at 6:18pm

targt123 (30)

How do I do that? Thanks

Dec 2, 2011 at 7:26pm

Catfish (666)

Use base 16, aka hexadecimal, and bytes.

On most computers, a byte (aka char) holds 8 bits.
A bit can have a value of either 0, or 1.

Here's the code table:


binary  hex
--------------
0000    0
0001    1
0010    2
0011    3
0100    4
0101    5
0110    6
0111    7
1000    8
1001    9
1010    A
1011    B
1100    C
1101    D
1110    E
1111    F

Using the table above, 8D will look like 10001101. Notice the eight bits.

And here's some code:

1
2
3

char c = '\x8D';
char c_array[]  = {'\x00', '\x3F'};
char c_array2[] = "\x00\x3F"; // caution an extra '\0' gets added to the end

And then you use the write() and read() functions which deal with char data.

http://www.cplusplus.com/reference/iostream/ostream/write/
http://www.cplusplus.com/reference/iostream/istream/read/
http://www.cplusplus.com/doc/hex/

Dec 2, 2011 at 9:20pm

Duthomhas (13276)

@Catfish
Your example uses vocabulary that will confuse the OP -- your "code table" is not the same thing as a Huffman code table.

@targt123
The smallest you can write to disk is eight bits at a time (or one byte). However, when dealing with compression, you must deal with individual bits.

What you must do is create a cache to store bits until you get enough (eight) to write.

[ _ _ _ _ _ _ _ _ ] -- A cache with no bits used.

If I put a bit in, then the cache still has seven empty spaces.

[ _ _ _ _ _ _ _ 1 ] -- A cache with one bit used.

If I put another bit in, it has fewer.

[ _ _ _ _ _ _ 0 1 ] -- A cache with one bit used.

I keep adding bits:

[ _ 1 1 0 1 0 0 1 ] -- A cache with all but one bit used.

When I add the final bit, the cache is full:

[ 1 1 1 0 1 0 0 1 ] -- A full cache (with all bits used).

Now that the cache is full, I can write its value (in this example, 0xE9 or 233 decimal) to file, and reset my cache to empty.

[ _ _ _ _ _ _ _ _ ] -- A cache with no bits used.

The simplest way to implement a cache is to keep track of how many bits are currently in it. A simple class to store the cache, the number of bits used, and methods to accept a bit to add to the cache and to flush should suffice. If you don't want to use a class, you can use functions with either global or static 'cache' and 'number of bits used' values.

-----
Likewise, when you are compressing values, you also need to know how many bits are used in a code (in addition to what those bit values are). For example, a simple Huffman table:
111 -three bits 010 -three bits 000 -three bits 1101 -four bits 1010 -four bits ...I yoinked the table from here: http://en.wikipedia.org/wiki/Huffman_coding

To output the fourth code, I send them to the output cache, one bit at a time: 1, 1, 0, 1. If I next output the first code (1, 1, 1), and then the fifth code (1, 0, 1, 0), I am doing my job -- I know that the data is being written to disk whenever the cache gets full.

When you are done writing all bits, flush the cache to make sure that all bits are written to disk -- that is, if it is not empty, just write it (and all the extra, random bits it contains) to disk.

-----
Pay attention to the order in which you put bits into the cache. Notice how I wrote them so that your decoder can read a bit and follow the bit pattern from the root of the tree down to a leaf. This is important.

That should be enough to get you started. Good luck!

Topic archived. No new replies allowed.