Reading huffman tree from file

In the code below, in the decode function, I try rebuild the huffman tree generated in the encode function, and pass this tree to the getCode function and get the character's codes to decode the file.

It should be the same code, but I notice when I visualize the codes (either in the encode function and in the decode function), that this 2 lists have the first and last characters misplaced (they are in switched places).

Everything else about the list of nodes read from file are the same as the one that was written: same size, same order for the other characters, etc. Only the first and last that are being switched.

Anyone can tell me what I am missing here?

I am using this text file as example: https://pastebin.com/zd5hJSgk. the complete code for this project is here: https://github.com/klebermo/huffman

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
#include <map>
#include <string>
#include <fstream>
#include <bitset>
#include <algorithm>

void getCode(std::map<char,std::string> * table, huffmanNode * node, std::string code = "") {
  if(node->getLeft() == nullptr && node->getRight() == nullptr) {
    table->insert({node->getData(), code});
    std::cout << node->getData() << ": " << code << std::endl;
    return;
  }

  if(node->getLeft() != nullptr) getCode(table, node->getLeft(), code+"0");
  if(node->getRight() != nullptr) getCode(table, node->getRight(), code+"1");
}

std::string getSubstring(int length) {
  std::string result = "";
  for(int i=0; i<length; i++)
    result = result + "0";
  return result;
}

void encode(std::string input_file, std::string output_file) {
  std::list<huffmanNode> priorityQueue;
  huffmanTree toEncode;
  std::map<char, std::string> encodeTable;

  std::ifstream input(input_file, std::ios_base::in);
  std::ofstream output(output_file, std::ios_base::out | std::ios_base::binary);

  if(input.is_open()) {
    char c;
    while(input.get(c)) {
      huffmanNode node(c, 1);

      std::list<huffmanNode>::iterator pos = find(begin(priorityQueue), end(priorityQueue), node);
      if(pos != end(priorityQueue)) {
        (*pos)++;
      } else {
        priorityQueue.push_back(node);
      }
    }
  }

  priorityQueue.sort();

  while(priorityQueue.size() > 1) {
    huffmanNode * left = new huffmanNode(priorityQueue.front());
    priorityQueue.pop_front();

    huffmanNode * right = new huffmanNode(priorityQueue.front());
    priorityQueue.pop_front();

    huffmanNode * z = new huffmanNode();
    z->setData(0x0);
    z->setFrequency(left->getFrequency() + right->getFrequency());
    z->setLeft(left);
    z->setRight(right);

    priorityQueue.push_back(*z);
  }

  toEncode.setRoot(&priorityQueue.front());
  getCode(&encodeTable, toEncode.getRoot());

  if (output.is_open()) {
    std::list<huffmanNode> pre = toEncode.preOrder();
    std::vector<huffmanNode> pre_vector(pre.begin(), pre.end());

    std::vector<huffmanNode>::size_type size = 0;
    for(std::vector<huffmanNode>::size_type i=0; i<pre.size(); i++)
      if(pre_vector[i].getLeft() == nullptr && pre_vector[i].getRight() == nullptr) size++;

    output.write(reinterpret_cast<char*>(&size), sizeof(size));

    for(std::vector<huffmanNode>::size_type i=0; i<pre.size(); i++) {
      if(pre_vector[i].getLeft() == nullptr && pre_vector[i].getRight() == nullptr) {
        char c = pre_vector[i].getData();
        output.write(&c, sizeof(c));
        int f = pre_vector[i].getFrequency();
        output.write(reinterpret_cast<char*>(&f), sizeof(f));
      }
    }

    input.clear();
    input.seekg(0, std::ios::beg);

    std::string encoded_string = "";
    char c;
    while(input.get(c)) {
      std::string code = encodeTable[c];
      encoded_string = encoded_string + code;
    }
    encoded_string = encoded_string + getSubstring(encoded_string.size() % 8);

    for(long unsigned int i=0; i<encoded_string.size(); i+=8) {
      std::string sub = encoded_string.substr(i, 8);
      std::bitset<8> bits(sub);
      unsigned long x = bits.to_ulong();
      unsigned char c = static_cast<unsigned char>(x);
      output.write(reinterpret_cast<char*>(&c), sizeof(c));
    }
  }

  input.close();
  output.close();
}

void decode(std::string input_file, std::string output_file) {
  std::list<huffmanNode> priorityQueue;
  huffmanTree toDecode;
  std::map<char, std::string> decodeTable;

  std::ifstream input(input_file, std::ios_base::in | std::ios_base::binary);
  std::ofstream output(output_file, std::ios_base::out);

  std::string encoded_string;
  if (input.is_open()) {
    std::vector<huffmanNode>::size_type size;
    input.read(reinterpret_cast<char*>(&size), sizeof(size));

    for(std::vector<huffmanNode>::size_type i=0; i<size; i++) {
      char c;
      input.read(&c, sizeof(c));
      int f;
      input.read(reinterpret_cast<char*>(&f), sizeof(f));

      huffmanNode node(c, f);
      priorityQueue.push_back(node);
    }

    char c;
    while(input.read(reinterpret_cast<char*>(&c), sizeof(c))) {
      std::bitset<8> bits(c);
      encoded_string = encoded_string + bits.to_string();
    }
  }

  priorityQueue.sort();

  while(priorityQueue.size() > 1) {
    huffmanNode * left = new huffmanNode(priorityQueue.front());
    priorityQueue.pop_front();

    huffmanNode * right = new huffmanNode(priorityQueue.front());
    priorityQueue.pop_front();

    huffmanNode * z = new huffmanNode();
    z->setData(0x0);
    z->setFrequency(left->getFrequency() + right->getFrequency());
    z->setLeft(left);
    z->setRight(right);

    priorityQueue.push_back(*z);
  }

  toDecode.setRoot(&priorityQueue.front());
  getCode(&decodeTable, toDecode.getRoot());

  if(output.is_open()) {
    std::string decoded_string;

    for(long unsigned int i=0; i<encoded_string.size(); i++) {
      for(auto it = decodeTable.begin(); it != decodeTable.end(); it++) {
        std::string code = it->second;
        std::string temp = encoded_string.substr(0, code.length());

        if(temp.compare(code) == 0) {
          decoded_string = decoded_string + it->first;
          encoded_string.erase(0, code.length());
        }
      }
    }

    output << decoded_string;
  }

  input.close();
  output.close();
}

int main(int argc, char ** argv) {
  if (argc < 4) {
    std::cout << "Usage: huffman (encode|decode) <input file> <output file>" << std::endl;
    return 0;
  }

  std::string action(argv[1]), input(argv[2]), output(argv[3]);

  if(action == "encode") encode(input, output);

  if(action == "decode") decode(input, output);

  return 0;
}
Last edited on
A problem could be that when you sort the elements the order of the equal elements are not maintained. stable_sort(...) might help. See:

https://cplusplus.com/reference/algorithm/stable_sort/
But I don't have equal elements in the tree (it contains the characters of the text, when reading the characters the frequency is added if they are equals, not a duplicate added to the list).
Well, that code is too complex to just skim through. I suggest that you use a debugger to go step by step through the code and see what happens. Can you reproduce the effect with another (smaller) text?

Is it relevant where within the tree that character appears?

What I mean with 'not maintained' is:
On line 35 you read all characters into the list (not a priority queue). On line 47 you sort the elements according to their frequency but the order of the same frequencies is not maintained.

How do you ensure at all that the correct order of the characters are restored after decode?
Is it relevant where within the tree that character appears?


Yes, it is.

How do you ensure at all that the correct order of the characters are restored after decode?


In the loop of lines 78-85, I save the frequencies for each character alongside them, and in the decode function I read both of them and instantiate a huffmanNode with this 2 values to add to the list.
Topic archived. No new replies allowed.