I need a code review for LZW archiver

(I decided to let the other LZW thread die, as it features a broken program.)

This is an excruciatingly slow implementation, but it's simple, and it seems to be working. Let me know if it corrupts your files, if you dare actually test it on your computer.

Suggestions for improvement most welcome.
I consider writing an "advanced" version, optimized and maybe even Boost'ed, just to give readers a serious utility if they want it.

One problem that I found: using Nuwen's MinGW 4.7.2, if you compile it with optimization turned on, it breaks the end program. The lambda functions, perhaps?

Edit: removed old code, see post below:
http://cplusplus.com/forum/lounge/92318/#msg496447
Last edited on
You should refactor your duplicated "named" lambda function.

Also, shouldn't the error be printed before the usage?
Last edited on
You should refactor your duplicated "named" lambda function.

I wouldn't say they're duplicate code because each fills up a different data structure.
What do you have in mind?

About the error message, you're right.
They can be factored using templates (no specialization either). In both cases, the data structures you are dealing with have an operator[], and the parameter it takes is convertible from type char.
They can be factored using templates (no specialization either). In both cases, the data structures you are dealing with have an operator[], and the parameter it takes is convertible from type char.


1
2
3
4
5
6
7
8
9
10
11
12
13
template <typename CodeTable>
void reset(CodeTable &ct)
{
    ct.clear();

    char c = std::numeric_limits<char>::min();

    do
    {
        ct[{c}] = {c};
    }
    while (c++ != std::numeric_limits<char>::max());
}


Honestly, I don't like it.

Before I can call that function template in decompress(), I'd have to vector::resize() the code table to a hard-coded size. And what is the benefit, anyway? The named lambdas contain code that has no place outside the functions where they're defined. Maybe you have a more elegant approach in mind?

Edit: hmm... clear() would go outside as well...
Last edited on
Ah, you're right about the resize thing - I had it in my head that resize and reserve were the same - the two code segments weren't close enough for me to easily compare them and I was too lazy to side-by-side them.

It just seems like a waste to duplicate the same basic structure of code like that, but I can't think of a good way to deal with that right now.
Because of the replies I received in this thread:
http://cplusplus.com/forum/general/92334/

I decided to use std::vector<char> instead of std::string.

However, on my machine, the program still freezes when trying to compress (output file size zero) if it was compiled with -O2 or -O3 options.
Anybody have any ideas as to why?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
///
/// @file
/// @author Julius Pettersson
/// @copyright MIT/Expat License.
/// @brief LZW file archiver
///
/// This is the C++11 implementation of a Lempel-Ziv-Welch single-file command-line archiver.
/// It uses the simpler fixed-width code compression method.
/// It was written with Doxygen comments.
///
/// http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
/// http://marknelson.us/2011/11/08/lzw-revisited/
/// http://www.cs.duke.edu/csed/curious/compression/lzw.html
/// http://en.cppreference.com/
/// http://www.doxygen.org/
///

#include <cstdint>
#include <cstdlib>
#include <exception>
#include <fstream>
#include <ios>
#include <iostream>
#include <limits>
#include <map>
#include <ostream>
#include <stdexcept>
#include <string>
#include <vector>

namespace {

typedef std::uint16_t CodeType; ///< Type used to store and retrieve codes.

namespace globals {

/// Code Table Maximum Size (when reached, the code table will be reset)
const CodeType ctms = std::numeric_limits<CodeType>::max();

} // namespace globals

///
/// @brief Helper operator intended to simplify code.
/// @param vc   original vector
/// @param c    element to be appended
/// @returns vector resulting from appending `c` to `vc`
///
std::vector<char> operator + (std::vector<char> vc, char c)
{
    vc.push_back(c);
    return vc;
}

///
/// @brief Compresses the contents of `is` and writes the result to `os`.
/// @param [in] is      input stream
/// @param [out] os     output stream
///
void compress(std::istream &is, std::ostream &os)
{
    // the Code Table, aka the Dictionary
    std::map<std::vector<char>, CodeType> code_table;

    // "named" lambda function, used to reset the code table to its initial contents
    auto reset_code_table = [&code_table] {
        code_table.clear();

        char c = std::numeric_limits<char>::min();

        do
        {
            code_table.insert({{c}, code_table.size()});
        }
        while (c++ != std::numeric_limits<char>::max());
    };

    reset_code_table();

    std::vector<char> s; // String
    char c;

    while (is.get(c))
    {
        // code table's maximum size was reached
        if (code_table.size() == globals::ctms)
            reset_code_table();

        s.push_back(c);

        if (code_table.count(s) == 0)
        {
            code_table.insert({s, code_table.size()});
            s.pop_back();
            os.write(reinterpret_cast<const char *> (&code_table.at(s)), sizeof (CodeType));
            s = {c};
        }
    }

    if (!s.empty())
        os.write(reinterpret_cast<const char *> (&code_table.at(s)), sizeof (CodeType));
}

///
/// @brief Decompresses the contents of `is` and writes the result to `os`.
/// @param [in] is      input stream
/// @param [out] os     output stream
///
void decompress(std::istream &is, std::ostream &os)
{
    // the Code Table, aka the Dictionary
    std::vector<std::vector<char>> code_table;

    // "named" lambda function, used to reset the code table to its initial contents
    auto reset_code_table = [&code_table] {
        code_table.clear();
        code_table.reserve(globals::ctms);

        char c = std::numeric_limits<char>::min();

        do
        {
            code_table.push_back({c});
        }
        while (c++ != std::numeric_limits<char>::max());
    };

    reset_code_table();

    std::vector<char> s; // String
    CodeType k; // Key

    while (is.read(reinterpret_cast<char *> (&k), sizeof (CodeType)))
    {
        // code table's maximum size was reached
        if (code_table.size() == globals::ctms)
            reset_code_table();

        if (k > code_table.size())
            throw std::runtime_error("invalid compressed code");
        else
        if (k == code_table.size())
            code_table.push_back(s + s.front());
        else
        if (!s.empty())
            code_table.push_back(s + code_table.at(k).front());

        os.write(&code_table.at(k).front(), code_table.at(k).size());
        s = code_table.at(k);
    }
}

///
/// @brief Prints usage information and a custom error message.
/// @param s    custom error message to be printed
/// @param su   Show Usage information
///
void print_usage(const std::string &s = "", bool su = true)
{
    if (!s.empty())
        std::cerr << "\nERROR: " << s << '\n';

    if (su)
    {
        std::cerr << "\nUsage:\n";
        std::cerr << "\tprogram -flag input_file output_file\n\n";
        std::cerr << "Where `flag' is either `c' for compressing, or `d' for decompressing, and\n";
        std::cerr << "`input_file' and `output_file' are distinct files.\n\n";
        std::cerr << "Examples:\n";
        std::cerr << "\tlzw.exe -c license.txt license.lzw\n";
        std::cerr << "\tlzw.exe -d license.lzw new_license.txt\n";
    }

    std::cerr << std::endl;
}

} // namespace

///
/// @brief Actual program entry point.
/// @param argc             number of command line arguments
/// @param [in] argv        array of command line arguments
/// @retval EXIT_FAILURE    for failed operation
/// @retval EXIT_SUCCESS    for successful operation
///
int main(int argc, char *argv[])
{
    if (argc != 4)
    {
        print_usage("Wrong number of arguments.");
        return EXIT_FAILURE;
    }

    enum class Mode {
        Compress,
        Decompress
    };

    Mode m;

    if (std::string(argv[1]) == "-c")
        m = Mode::Compress;
    else
    if (std::string(argv[1]) == "-d")
        m = Mode::Decompress;
    else
    {
        print_usage(std::string("flag `") + argv[1] + "' is not recognized.");
        return EXIT_FAILURE;
    }

    std::ifstream input_file(argv[2], std::ios_base::binary);

    if (!input_file.is_open())
    {
        print_usage(std::string("input_file `") + argv[2] + "' could not be opened.");
        return EXIT_FAILURE;
    }

    std::ofstream output_file(argv[3], std::ios_base::binary);

    if (!output_file.is_open())
    {
        print_usage(std::string("output_file `") + argv[3] + "' could not be opened.");
        return EXIT_FAILURE;
    }

    try
    {
        input_file.exceptions(std::ios_base::badbit);
        output_file.exceptions(std::ios_base::badbit | std::ios_base::failbit);

        if (m == Mode::Compress)
            compress(input_file, output_file);
        else
        if (m == Mode::Decompress)
            decompress(input_file, output_file);
    }
    catch (const std::ios_base::failure &f)
    {
        print_usage(std::string("File input/output failure: ") + f.what() + '.', false);
        return EXIT_FAILURE;
    }
    catch (const std::exception &e)
    {
        print_usage(std::string("Caught exception: ") + e.what() + '.', false);
        return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}
I made a couple changes to make it compile on VC++ (mostly changing {} notations to emplace/emplace_back) and it worked flawlessly on a 46k file with miscellaneous text and a 1048k file with primes in it. I was hoping it would provide some information about where things were failing, but it didn't cooperate.

I was going to try gcc with the same changes made, but I need to install a newer version to do that. Maybe down the road a bit.

Search for CHANGE in the following to see the changes I made to make it work on VC++.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
///
/// @file
/// @author Julius Pettersson
/// @copyright MIT/Expat License.
/// @brief LZW file archiver
///
/// This is the C++11 implementation of a Lempel-Ziv-Welch single-file command-line archiver.
/// It uses the simpler fixed-width code compression method.
/// It was written with Doxygen comments.
///
/// http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
/// http://marknelson.us/2011/11/08/lzw-revisited/
/// http://www.cs.duke.edu/csed/curious/compression/lzw.html
/// http://en.cppreference.com/
/// http://www.doxygen.org/
///

#include <cstdint>
#include <cstdlib>
#include <exception>
#include <fstream>
#include <ios>
#include <iostream>
#include <limits>
#include <map>
#include <ostream>
#include <stdexcept>
#include <string>
#include <vector>

namespace {

typedef std::uint16_t CodeType; ///< Type used to store and retrieve codes.

namespace globals {

/// Code Table Maximum Size (when reached, the code table will be reset)
const CodeType ctms = std::numeric_limits<CodeType>::max();

} // namespace globals

///
/// @brief Helper operator intended to simplify code.
/// @param vc   original vector
/// @param c    element to be appended
/// @returns vector resulting from appending `c` to `vc`
///
std::vector<char> operator + (std::vector<char> vc, char c)
{
    vc.push_back(c);
    return vc;
}

///
/// @brief Compresses the contents of `is` and writes the result to `os`.
/// @param [in] is      input stream
/// @param [out] os     output stream
///
void compress(std::istream &is, std::ostream &os)
{
    // the Code Table, aka the Dictionary
    std::map<std::vector<char>, CodeType> code_table;

    // "named" lambda function, used to reset the code table to its initial contents
    auto reset_code_table = [&code_table] {
        code_table.clear();

        char c = std::numeric_limits<char>::min();

        do
        {
            code_table.emplace( std::vector<char>(1,c), code_table.size()) ;
            // code_table.insert({{c}, code_table.size()}); // CHANGE
        }
        while (c++ != std::numeric_limits<char>::max());
    };

    reset_code_table();

    std::vector<char> s; // String
    char c;

    while (is.get(c))
    {
        // code table's maximum size was reached
        if (code_table.size() == globals::ctms)
            reset_code_table();

        s.push_back(c);

        if (code_table.count(s) == 0)
        {
            code_table.emplace(s, code_table.size()) ;
            // code_table.insert({s, code_table.size()}); // CHANGE
            s.pop_back();
            os.write(reinterpret_cast<const char *> (&code_table.at(s)), sizeof (CodeType));
            s = std::vector<char>(1,c);
            // s = {c};  // CHANGE
        }
    }

    if (!s.empty())
        os.write(reinterpret_cast<const char *> (&code_table.at(s)), sizeof (CodeType));
}

///
/// @brief Decompresses the contents of `is` and writes the result to `os`.
/// @param [in] is      input stream
/// @param [out] os     output stream
///
void decompress(std::istream &is, std::ostream &os)
{
    // the Code Table, aka the Dictionary
    std::vector<std::vector<char>> code_table;

    // "named" lambda function, used to reset the code table to its initial contents
    auto reset_code_table = [&code_table] {
        code_table.clear();
        code_table.reserve(globals::ctms);

        char c = std::numeric_limits<char>::min();

        do
        {
            code_table.emplace_back(std::vector<char>(1,c)); 
            // code_table.push_back({c}); // CHANGE
        }
        while (c++ != std::numeric_limits<char>::max());
    };

    reset_code_table();

    std::vector<char> s; // String
    CodeType k; // Key

    while (is.read(reinterpret_cast<char *> (&k), sizeof (CodeType)))
    {
        // code table's maximum size was reached
        if (code_table.size() == globals::ctms)
            reset_code_table();

        if (k > code_table.size())
            throw std::runtime_error("invalid compressed code");
        else
        if (k == code_table.size())
            code_table.push_back(s + s.front());
        else
        if (!s.empty())
            code_table.push_back(s + code_table.at(k).front());

        os.write(&code_table.at(k).front(), code_table.at(k).size());
        s = code_table.at(k);
    }
}

///
/// @brief Prints usage information and a custom error message.
/// @param s    custom error message to be printed
/// @param su   Show Usage information
///
void print_usage(const std::string &s = "", bool su = true)
{
    if (!s.empty())
        std::cerr << "\nERROR: " << s << '\n';

    if (su)
    {
        std::cerr << "\nUsage:\n";
        std::cerr << "\tprogram -flag input_file output_file\n\n";
        std::cerr << "Where `flag' is either `c' for compressing, or `d' for decompressing, and\n";
        std::cerr << "`input_file' and `output_file' are distinct files.\n\n";
        std::cerr << "Examples:\n";
        std::cerr << "\tlzw.exe -c license.txt license.lzw\n";
        std::cerr << "\tlzw.exe -d license.lzw new_license.txt\n";
    }

    std::cerr << std::endl;
}

} // namespace

///
/// @brief Actual program entry point.
/// @param argc             number of command line arguments
/// @param [in] argv        array of command line arguments
/// @retval EXIT_FAILURE    for failed operation
/// @retval EXIT_SUCCESS    for successful operation
///
int main(int argc, char *argv[])
{
    if (argc != 4)
    {
        print_usage("Wrong number of arguments.");
        return EXIT_FAILURE;
    }

    enum class Mode {
        Compress,
        Decompress
    };

    Mode m;

    if (std::string(argv[1]) == "-c")
        m = Mode::Compress;
    else
    if (std::string(argv[1]) == "-d")
        m = Mode::Decompress;
    else
    {
        print_usage(std::string("flag `") + argv[1] + "' is not recognized.");
        return EXIT_FAILURE;
    }

    std::ifstream input_file(argv[2], std::ios_base::binary);

    if (!input_file.is_open())
    {
        print_usage(std::string("input_file `") + argv[2] + "' could not be opened.");
        return EXIT_FAILURE;
    }

    std::ofstream output_file(argv[3], std::ios_base::binary);

    if (!output_file.is_open())
    {
        print_usage(std::string("output_file `") + argv[3] + "' could not be opened.");
        return EXIT_FAILURE;
    }

    try
    {
        input_file.exceptions(std::ios_base::badbit);
        output_file.exceptions(std::ios_base::badbit | std::ios_base::failbit);

        if (m == Mode::Compress)
            compress(input_file, output_file);
        else
        if (m == Mode::Decompress)
            decompress(input_file, output_file);
    }
    catch (const std::ios_base::failure &f)
    {
        print_usage(std::string("File input/output failure: ") + f.what() + '.', false);
        return EXIT_FAILURE;
    }
    catch (const std::exception &e)
    {
        print_usage(std::string("Caught exception: ") + e.what() + '.', false);
        return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}
Thank you, cire.
Unfortunately, your code doesn't compile with Nuwen's MinGW 4.7.2 (if you tried an earlier version):

lzw.cire.cpp: In lambda function:
lzw.cire.cpp:72:24: error: 'class std::map<std::vector<char>, short unsigned int>' has no member named 'emplace'
lzw.cire.cpp: In function 'void {anonymous}::compress(std::istream&, std::ostream&)':
lzw.cire.cpp:93:24: error: 'class std::map<std::vector<char>, short unsigned int>' has no member named 'emplace'

Last edited on
compiles and runs with clang-3.2/libc++-svn and with gcc-4.8-20121216
@ Cubbi: do both versions (mine and cire's) run correctly if you compile them with -O3? Thanks.
I used -O3 with both compilers. The version I tried was the last one posted (cire's, with map::emplace). I'll give the other one a try in the evening.
Anybody have any more suggestions, or see some bugs?
Suggestion: compile under a C++11 compiler.
Topic archived. No new replies allowed.