trouble with reading words from a file into a hash_map

hello,

I am completely new to C++ and I am trying to learn it. I am trying to make a program which, given some word and a filename,will basically keep a tally of each word that comes immediately after each instance of the word specified. so, for a test file (called test_file) that looks like this:


test one
test two test two
test three test three test three
test four test four test four test four
test one


and given "test" as the specified word, I want to create a mapping that looks like this:

one -> 2
two -> 2
three -> 3
four -> 4


This is what I tried:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <iostream>
#include <string>
#include <stdio.h>
#include <fstream>
#include <ext/hash_map>  
#include <vector>

//g++ posterior_word_debug.cpp -o posterior_word_debug

using namespace std;
using namespace __gnu_cxx;

//string equality for hash_map  
struct eqstr
{
  bool operator()(const char* s1, const char* s2) const
  {
    return strcmp(s1, s2) == 0;
  }
};

int main( int argc, char *argv[] ) {

    string word = "test";
    string filename = "test_file";
    hash_map<const char*, int, hash<const char*>, eqstr> words;
    vector<string> unique_words;
    ifstream read_file(filename.c_str()); 
    string curr_word = "";
    string prev_word = "";
    while (read_file >> curr_word) {

        if (prev_word == word) {
            if (words[curr_word.c_str()] == 0) {
                unique_words.push_back(curr_word.c_str());
            }
            words[curr_word.c_str()]++;  
        }
        prev_word = curr_word;
        
//         // [A] This is a hack that makes it work.
//         int a = words["one"];
//         a=words["two"];
//         a=words["three"];
//         a=words["four"];  


        // [B] This does NOT make it work.
        //This is the one would I want, though, because I don't
        //know what the literal strings will always be.
//     int a = 0;
//         for (int i=0; i<unique_words.size(); i++) {
//             a = words[unique_words[i].c_str()];
//         }

        
    }

    //show unique words and their counts
    for (int i=0; i<unique_words.size(); i++) {
        cout << unique_words[i] << " -> " << words[unique_words[i].c_str()] << endl;
    }
  
    return 0;

}


when it is just run as is, it outputs:

one -> 1
two -> 1
two -> 1
three -> 2
three -> 2
four -> 3
four -> 3
one -> 1

This is the first thing that I don't really understand - it looks like the vector of unique words isn't really doing what it should be. also, the counts are all wrong - it looks like it's making duplicate entries in the hash_map instead of incrementing the old ones, which, given the equality function, it shouldn't be doing. But - during the course of trying to debug this, I added in the section labelled A (originally using "cout"s to look at what it was doing). if you run it with that part uncommented, it outputs:

one -> 2
two -> 2
three -> 3
four -> 4

which is perfect! but it is also the second thing I don't understand - why should assigning the numbers in the mapping to a, and then not using that variable at all, change the results? also, I don't want to have to write "a=words[foo];" for each and every unique word foo, because for inputs other than test_file, I won't know what those will be. something more like the section labelled B would be better.

However, when A is commented and B is uncommented, the output is this again:

one -> 1
two -> 1
two -> 1
three -> 2
three -> 2
four -> 3
four -> 3
one -> 1

which is not what I want it to do.

Thinking that perhaps there was something weird about reading the words directly from a file, I tried, instead, to load all the words from a file into a vector<string>, and then, when populating the hash_map, to read from the vector instead of directly from the file, like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
int main( int argc, char *argv[] ) {
    
    //try without directly reading from file....
    vector<string> words_from_file;
    string filename = "test_file";
    ifstream read_test_file(filename.c_str());
    string file_word = "";
    while (read_test_file >> file_word) {
        words_from_file.push_back(file_word.c_str());
    }

    string word = "test";
    hash_map<const char*, int, hash<const char*>, eqstr> words;    
    vector<string> unique_words;
    string curr_word = "";
    string prev_word = "";
    for (int i=0; i<words_from_file.size(); i++) {
        curr_word = words_from_file[i];
        if (prev_word == word) {
            if (words[curr_word.c_str()] == 0) {
                unique_words.push_back(curr_word.c_str());
            }
            words[curr_word.c_str()]++;  
        }
        prev_word = curr_word;

    }

    //unique_word -> count
    for (int i=0; i<unique_words.size(); i++) {
        cout << unique_words[i] << " -> " << words[unique_words[i].c_str()] << endl;
    }

    return 0;

}


which outputs this:

one -> 2
two -> 2
three -> 3
four -> 4

which is perfect! without even using the dummy variable a. except for that reading the entire file into a vector in memory is likely to be very expensive for larger files, so I would rather not do it this way.

does anyone know why it behaves so weirdly when you read directly from the file (i.e. why you need the dummy variable for it to work)? any suggestions for how to fix this without reading the entire file into a vector? I'm not even really sure how to classify the problem, so any advice on how to fix this or where to look would be greatly appreciated.

thanks,
emu
Last edited on
Topic archived. No new replies allowed.