Lossless data compression algorithm ques

Forum

Forum
Lounge
Lossless data compression algorithm ques

Lossless data compression algorithm question

Mar 5, 2022 at 5:55am

Hello, I have the following question about LZ77, a lossless data compression algorithm, wondering if anyone here would know how to solve this:

Given the initial codebook (symbol is left column, code is right column):
_______________________
Symbol | Code
(space) - | 110
B | 1110
H | 11110
I | 10
L | 0
_______________________

We are told that the length of both the search and lookahead buffer are 8, meaning 3 bits for coding the distance and the match length. The coding window is positioned as follows: ILL_B which is in the search buffer, and ILLI which is in the lookahead buffer:
_________________________
...ILL_B | ILLI...
_________________________

We are asked to provide the binary code for this coding step, can someone tell me what it is and how they arrived at it?

Last edited on Mar 5, 2022 at 5:56am

Mar 7, 2022 at 9:49am

kigar64551 (837)

TTBOMK, LZ77 (in its "pure" form) is not concerned with how symbols are encoded binary! Instead it just deals with "symbols", regardless of how they are represented. It uses a "sliding window" approach, which means that all input symbols that are processed also are appended to the "window" buffer. Once the "window" has reached its maximum size, the oldest symbols are dropped, so that the size limit won't be exceeded.

Every sequence of input symbols is stored as a triple of:
1. The number of symbols to copy (i.e re-use) from the current "window" buffer
2. The offset within the "window" buffer where to start copying
3. The next symbol (literal) after the symbols that were copied from the buffer

Note: Values for (1) and (2) are chosen by the compressor in such a way that the longest possible sequence is copied from the "window" buffer, before the literal symbol is stored. This maximizes the compression. Still, the sequence to be copied may be of length zero, e.g. if the next input symbol is not found in the buffer at all.

In practice, we usually use a combination of LZ77 (or one of its variants) and entropy coding. It's the entropy coding that is concerned with how symbols (literal) are encoded in binary. And not just the symbols, also the offsets and lengths! Common methods of entropy coding include Huffmann-Trees and Arithmetic-Coding.

For example, Deflate, the compression algorithm used in ZIP and also in many other applications (probably the most widely used compression algorithm), is a combination of LZSS (a variant of LZ77) and Huffmann-Trees.

https://en.wikipedia.org/wiki/Deflate

Last edited on Mar 7, 2022 at 10:21am

Topic archived. No new replies allowed.

C++

Forum

Lossless data compression algorithm question