Review my code? Bigrams Probabilities

The following is the code I wrote for this assignment: https://drive.google.com/file/d/0B-t5ghDb_TCqVzRvNFkycTRjUzQ/view?pref=2&pli=1

Here were the files we were given to read from: https://drive.google.com/folderview?id=0B-t5ghDb_TCqcGJlR2lWb0pmdHM&usp=drive_web

A simple request: critique my code.

Tell me all the things I've done wrong, all the bad practices that a professional programmer would cringe at (beside including namespace std which I was too lazy to change). Anything I can improve on?

Also, my professor mentioned that this assignment could be done in 20 lines or less. Was she BS-ing or is that really possible? Keep in mind the using maps for this assignment was mandatory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#include <iostream>
#include <fstream>
#include <map>
#include <utility> // for pair

using namespace std;

typedef map<pair<string, string>, int> StrPairIntMap;
typedef map<string, int>  StrIntMap;
typedef map<pair<string, string>, double> StrPairDoubleMap;

StrPairDoubleMap CalcBigrams(ifstream&);
void PrintBigrams(StrPairDoubleMap&, string, string);
int main(int argc, char* argv[])
{
    if(argc < 2) // checking to see if 2 arguments were passed from command line
    {
        cout << "TOO FEW ARGUMENTS\n";
        return 0;
    }

    ifstream file(argv[1]) // delcare and open file

    if(!file) // check if file opened
    {
        cout << "FILE NOT OPENED";
        return 0;
    }

    string w1, w2;
    StrPairDoubleMap m3 = CalcBigrams(file);
    while(true)
    {
        cout << "Enter two words (q to quit): ";

        cin >> w1;
        if(w1 == "q") break;

        cin >> w2;
        if (w2 == "q")break;
        PrintBigrams(m3, w1, w2);
    }

    file.close();

    return 0;
}

// This function calculates the probabilities of a pair of words appearing
// using the formula freq(w1, w2)/freq(w1)
StrPairDoubleMap CalcBigrams(ifstream& file)
{

    string word1, word2;
    StrPairIntMap m1;
    StrIntMap m2;
    StrPairDoubleMap m3;

    // while loop reads in words until end of file
    file >> word1;
    while(file)
    {
        file >> word2;
        m1[make_pair(word1, word2)]++; // insert pair and increment int value
        m2[word1]++; // insert word and increment int value
        word1 = word2;
    }

    StrPairIntMap::iterator it1 = m1.begin();
    double prob;

    // loop until end of m1
    while(it1 != m1.end())
    {
        prob = (1.0*it1->second)/m2[it1->first.first]; // int value / int value
        m3[it1->first] = prob; // insert word1 to m3, set double value to prob
        it1++;
    }

    return m3;
}

//This function simply prrints the probability of pair of words
void PrintBigrams(StrPairDoubleMap& m3, string w1, string w2)
{
    cout << m3[make_pair(w1, w2)] << endl;
}
Last edited on
beside including namespace std which I was too lazy to change


You really have to bust out of this bad habit, sooner rather than later :+)

Pedantically, consider having using instead of typedef, even though they have exactly the same meaning in this context, using is considered more "modern".

Also pedantically, I am not a fan of declaring multiple variables on one line. And always initialise non container types, even when you get input straight after, as in the double on line 70.


69
70
71
72
73
74
75
76
77
78
 StrPairIntMap::iterator it1 = m1.begin();
    double prob;

    // loop until end of m1
    while(it1 != m1.end())
    {
        prob = (1.0*it1->second)/m2[it1->first.first]; // int value / int value
        m3[it1->first] = prob; // insert word1 to m3, set double value to prob
        it1++;
    }


Could you use a range based for loop there? Seen as you are iterating through the whole container :+) It will make the code cleaner IMO.

1
2
3
4
void PrintBigrams(StrPairDoubleMap& m3, string w1, string w2)
{
    cout << m3[make_pair(w1, w2)] << endl;
}


Pass the parameters by const reference, especially for containers or classes like std::string.

Consider using "\n" rather than std::endl.

Missing a ; on line 22 - compiler told me that :+)
Last edited on
> They can be interpreted as "how likely is it that word 2 will follow word 1?"

Shouldn't we then be ignoring case and trailing punctuation?
"be" follows "to" in "To be, ...", "... to be ...", "... to be."

Instead of storing each unique pair, wouldn't a map storing the successors of each word suffice?
May be something along these lines?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// remove punctuation at the end, convert to lower case
std::string& clean( std::string& token )
{
    while( !token.empty() && std::ispunct( token.back() ) ) token.pop_back() ;
    for( char& c : token ) c = std::tolower(c) ;
    return token ;
}

// map word to ordered list of successors of the word
using successor_map = std::map< std::string, std::multiset<std::string> > ;

successor_map make_successor_map( std::istream& stm )
{
    successor_map map ;

    std::string first ;
    if( stm >> first )
    {
        clean(first) ;

        std::string second ;
        while( stm >> second )
        {
            map[first].insert( clean(second) ) ;
            first = second ;
        }
        map[first].insert( "\0" ) ; // the last word has no successor
    }

    return map ;
}

double bigram_probability( const successor_map& map, std::string w1, std::string w2 )
{
    const auto iter = map.find( clean(w1) ) ;
    if( iter == map.end() ) return NAN ;
    else return iter->second.count( clean(w2) ) / double( iter->second.size() ) ;
}

http://coliru.stacked-crooked.com/a/1cf1c1482a928910
@TheIdeadMan:As for using range-based for-loops, all my assignments have to be done in c++98. So Im not really used to c++11 anyway, but of course I do intend to gradually transition myself into it.

Everything else has been noted and will be implemented in my future programs.

@JLBorges: Regarding punctuation, I had actually asked my professor about this and she said to leave it. Thats why I dont have a clean function. However, I know it makes more sense to remove the punctuation.

I'll have to look at your code in detail later.
Last edited on
On a related note, what do you guys say about making a list of all the things you will be using from std namespace in the beginning of your code, so that you wont have to include the scope and qualifier in the code afterwards?

So for example, if I do the following:
1
2
3
4
5
6
7
8
9
10
using std::map;
using std::pair;
using std::ifstream;
using std::cout;
using std::endl;
using std::string;
using std::cin;


// rest of my code will not need std:: 


Is this also considered a bad practice? I've done this for a couple of programs where I know I'll be using cout, cin, endl and string a lot.

Last edited on
It's better than using namespace std;, but personally I don't like either, at least for std. For non-standard namespaces, I prefer to put the using in the scope it will be used:
1
2
3
4
int f(T x){
    using foo::bar::make_snafucated;
    return make_snafucated(x);
}
Last edited on
Hi,

I used to do that, now I find it easier to just use std:: everywhere. I guess I was tired of putting the same using statements at the top of each file every time, and it is easy to accumulate lots of them. All the experienced C++ coders all seem to use std:: everywhere.

With std::endl , it flushes the buffer each time, so it could be a little inefficient. That's why using "\n" is often used instead. Apparently std::cout will still flush automatically if required. Have a look at these:

http://en.cppreference.com/w/cpp/io/cerr
http://en.cppreference.com/w/cpp/io/clog
Something weird is going on.....

If I run this code on CodeBlocks, I get .0361174 as the output for the words "with me".
I run the same code on linux (using command line arguments) with the words "with me" and I get .0181488

The second one is correct as it matches the number the professor got. But why am I getting a different value when I run it on codeblocks? Both codes are exactly the same. I even pasted the one on codeblocks into a file on linux and compared the files using diff command, and they were exactly the same.

Am I missing something?
Last edited on
> Am I missing something?

Yes.

1
2
3
4
5
6
7
8
while(file)
{
    file >> word2; // *** what would happen if this attempted input fails?
                   // *** if we blithely proceed as if word2 was read in successfully?

    m1[make_pair(word1, word2)]++; // insert pair and increment int value
    // ...
}
Im still not understanding. Doesn't the while condition guarantee that word2 was read in?

And if there is indeed something going on here, why is it only happening on codeblocks? Wouldn't the problem persist on linux terminal?

Forgive me if I'm being thick-headed here. You'll have to elaborate as I am still lost as to what is wrong with the code.
Also, if it makes a difference, Im using g++ compiler on the linux terminal and my compiler in codeblocks is set to GNU GCC. There is no g++ compiler option in the codeblocks version I have.
Im still not understanding. Doesn't the while condition guarantee that word2 was read in?

No. All it guarantees is that file was not in an error state when the loop begin. A more likely loop condition would be: while (file >> word2)
Just tried that. Still gives me the same result.
Topic archived. No new replies allowed.