program to find most common word

So covid19 has struck, I have no excuses in the world not to code so I wrote a simple program to count the number of characters and number of words in a text file, not only that but I decided to find the most common word in that text file.

so I want some feedback, to me the code looks messy and complicated and also lacks modularity, so should I break some of the loops in the mostCommonWord function into possibly helper functions?

(also yes I know.... what happens if 2 or more words are as equally common, I have yet to implement that.)

thanks!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>

using namespace std;

void sort(vector<string>& words){

    bool fixed = false;

    while(!fixed){

        fixed = true;

        for(int i = 0; i < words.size()-1; ++i){

            if(words.at(i) > words.at(i+1)){

                string temp = words.at(i);
                words.at(i) = words.at(i+1);
                words.at(i+1) = temp;
                fixed = false;
            }
        }
    }
}

string mostCommonWord(vector<string>& words){

     string mostCommon;

     vector<string> paragraph;
     vector<string> mostCommonWords;
     stringstream ss;


     for(int i = 0; i < words.size(); ++i){

        vector<string> allWords;
        string word;
        ss << words.at(i);

        while(ss >> word){

            if(word.at(word.size()-1) == ',' || word.at(word.size()-1) == '.')
                word.erase(word.size()-1,1);

            allWords.push_back(word);
        }

        sort(allWords);
        int biggestCount = 0;
        int biggestIndex = 0;
        int count = 0;

        for(int i = 0; i < allWords.size()-1; ++i){

            if(allWords.at(i) == allWords.at(i+1)){

                count++;
            }
            else
                count = 0;

            if(count > biggestCount){
                biggestCount = count;
                biggestIndex = i;
            }
        }
        mostCommonWords.push_back(allWords.at(biggestIndex));


        ss.str(string() );
        ss.clear();

     }

    for(int i = 0; i < words.size(); ++i){

        vector<string> allWords;
        string word;
        ss << words.at(i);

        while(ss >> word){

            if(word.at(word.size()-1) == ',' || word.at(word.size()-1) == '.')
                word.erase(word.size()-1,1);

            allWords.push_back(word);
        }

        int highestCount = 0;
        int count = 0;
        int biggestIndex = 0;

        for(int j = 0; j < mostCommonWords.size(); ++j){

            count = 0;

            for(int k = 0; k < allWords.size(); ++k){

                if(mostCommonWords.at(j) == allWords.at(k)){

                    ++count;
                    if(count > highestCount){

                        highestCount = count;
                        biggestIndex = j;
                    }
                }
            }
        }
        return mostCommonWords.at(biggestIndex);
    }
}

int main()
{

    ifstream in;
    in.open("words.txt");
    string word;
    stringstream ss;
    int wordCount = 0;
    int charCount = 0;

    while(in >> word){

        charCount+= word.size();
        wordCount++;
    }

    in.clear();
    in.seekg(0,ios::beg);

    cout << "word count == " << wordCount << endl;
    cout << "char count == " << charCount << endl;

    vector<string> allWords;
    string tempString;

    while(getline(in,tempString)){

        if(tempString == "")
            continue;
        else
            allWords.push_back(tempString);
    }

    string mostCommon = mostCommonWord(allWords);
    cout << "most common word = " << mostCommon << endl;

    return 0;
}
Just at a glance, it seems like you're doing unnecessary work.
First, you're parsing the file more than once. Second, it seems like you convert things from and into stringstreams redundantly.

Also, from a readability standpoint, you have a vector<string> called "allWords" but this appears to be "lines" instead from main. But then within mostCommonWord, you have "allWords" and "words" as variable names, with "allWords" in main becoming "words" in mostCommonWord. A bit confusing.

Here's what I would do:
- use >> to get each word, delimited by whitespace
- parse out the commas or periods (your line 88)
- add to a map<string, int>, where int is the number of occurrences of that word.
- Then, after you're done iterating, I think you'll be able to leverage the standard library's max_element function to find the most common word instead.

Edit: You actually can also avoid using max_element if you keep track of the biggest {word + count} pair as you go.
Last edited on
so should I break some of the loops in the mostCommonWord function into possibly helper functions?


This is probably one of the biggest issues I see with the code that I'm working with. Functions get so long that you cannot figure out what the function is supposed to be doing without a lot of time poring over it.

So: If you are asking the question, then YES! break it into separate functions. With experience you will see when you are overdoing it, but for for your toy project, err on the side of breaking things into pieces that are too small.
I think this is a valid simplification of your code.
Note what Ganado said about using a map. It's a lot easier.
And of course, if you really need to sort a vector, there's std::sort.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#include <iostream>
#include <fstream>
#include <vector>
#include <cctype>
using namespace std;

void sort(vector<string>& words)
{
    for (bool sorted = false; !sorted; )
    {
        sorted = true;
        for (size_t i = 0; i < words.size() - 1; ++i)
            if (words.at(i) > words.at(i + 1))
            {
                swap(words.at(i), words.at(i + 1));
                sorted = false;
            }
    }
}

void lowercase(string& s)
{
    for (auto& ch: s) ch = tolower(ch);
}

string mostCommonWord(vector<string>& words)
{
    for (auto& w: words)
    {
        if (ispunct(w.back())) w.pop_back();
        lowercase(w);
    }

    sort(words);

    int biggestCount = 1, biggestIndex = 1, count = 1;
    for (size_t i = 0; i < words.size(); ++i)
    {
        if (i < words.size() - 1 && words.at(i) == words.at(i + 1))
            ++count;
        else
        {
            if (count > biggestCount)
            {
                biggestCount = count;
                biggestIndex = i;
            }
            count = 1;
        }
    }

    return words.at(biggestIndex);
}

int main()
{
    vector<string> allWords;
    int wordCount = 0, charCount = 0;
    {
        ifstream in("words.txt");
        string word;
        while (in >> word)
        {
            allWords.push_back( word );
            charCount += word.size();
            ++wordCount;
        }
    }
    cout << "word count == " << wordCount << '\n';
    cout << "char count == " << charCount << '\n';

    string mostCommon = mostCommonWord(allWords);
    cout << "most common word = " << mostCommon << '\n';
}

I am with Ganado here. I think you can find a better way rather than try to break up what you have. Breaking it up may be a great exercise, though, just to say you did it and to practice the idea of small functions doing small things working together.

I would add 'upper or lower case the words' to his list.
Ooh ooh! Me too!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
// Find the most common word

#include <algorithm>
#include <cctype>
#include <ciso646>
#include <functional>
#include <iostream>
#include <sstream>
#include <string>
#include <unordered_map>

using std::string;

// All this stuff is to make our histogram case-INsensitive

string tolower( const string& s )
{
  auto t{ s };
  for (char& c : t) c = (char)std::tolower( c );
  return t;
}

struct ci_string_equal_to
{
  bool operator () ( const string& a, const string& b ) const
  {
    return tolower( a ) == tolower( b );
  }
};

struct ci_string_hash : public std::hash <string>
{
  bool operator () ( const string& s ) const
  {
    return std::hash <string> ::operator () ( tolower( s ) );
  }
};

// The histogram

using histogram = std::unordered_map <string, std::size_t, ci_string_hash, ci_string_equal_to> ;

// Main program

int main()
{
  histogram h;
  string    s;
  while (getline( std::cin, s ))
  {
    // Replace ALL punctuation except ' (words like 'cause and ain't and trippin' are all valid)
    for (char& c : s) if (!std::isalnum( c ) and (c != '\'')) c = ' ';

    // Extract words
    std::istringstream iss( s );
    while (iss >> s)
    {
      // If word both begins and ends with ' then we can safely discount both apostrophes
      if ((s.front() == '\'') and (s.back() == '\''))
      {
        h[ s.substr( 1, s.size() - 2 ) ] += 1;
        continue;
      }

      // Otherwise we'll add both with and without leading and trailing ' to the histogram
      if (s.front() == '\'') h[ s.substr( 1 )               ] += 1;
      if (s.back()  == '\'') h[ s.substr( 0, s.size() - 1 ) ] += 1;

      // Count the word
      h[ s ] += 1;
    }
  }
  
  // Find the largest repeat count
  std::size_t n = 0;
  for (auto [_, count] : h)
    n = std::max( n, count );
  
  // Special case: no words
  if (n == 0)
  {
    std::cout << "There are no words.\n";
    return 0;
  }
  
  // Special case: no repeats
  if (n == 1)
  {
    std::cout << "There are " << h.size() << " distinct words (with no repeats).\n";
    return 0;
  }
  
  // Extract the word(s) with the largest repeat count
  std::vector <string> most_common_words;
  for (auto [word, count] : h)
    if (count == n)
      most_common_words.emplace_back( word );
    
  // Display the most common word(s)
  std::cout << "The most common word" << ((most_common_words.size() > 1) ? "s are:\n" : " is:\n");
  for (auto word : most_common_words)
    std::cout << " " << word << "\n";
}

:O)
Compound adjectives?
End-of-line hyphenation?

And it's a bit embarassing that "C++" becomes ... "C".
Last edited on
Meh. Compound adjectives are two (or more) words.
Screw up a word with EOL hyphenation it’s yer own fault.
“C++” isn’t a word. (Sorry.)

The only way to even come close to getting this right would be to have an international dictionary... Alas.

:O)
Ah, just realised that your code works for compound adjectives, since it replaces the hyphens by a blank and hence (correctly) separates the words.
Thanks guys,

great points, that is another flaw in my code words with the first letter capitalised are counted as separate words , this shouldn't be the case.

I've been told this on numerous occasions, to change my code to the newer versions of C++,

right now my code is very C++98 orientated, it's crazy how different C++11+ looks to C++98, it almost looks like a new language with the amount of added features, one may argue though that with all these additions the language may start to suffer from feature creep?
suffer from feature creep?

a little. a lot of C and C++98 and C++ pre-98 are there to allow you to compile old code without messing with it, or minimally. So the 'excess' old features are not intended to be used for new code, though people do because copy from internet which itself is over 30 now and has old info all over, + old books, + some schools are behind, etc.

If you stick to the newer stuff, you write a lot less code to do the same work. The downside is, there are a lot of tools that take a while to learn.

At the end of the day, the language is meant to make expressing your design/idea easier. The new stuff is doing a good job of that, at the cost of having to learn more things.
The problem is that C++ is steered by a committee. 'Nuff said.

...but Imma gonna say more anyway...

I recently read a not-too-old article about this very design problem:

    std::visit is everything wrong with modern C++
    https://bitbashing.io/std-visit.html

And it is absolutely right. Current C++ design goals (for the past N years now) have favored technical creep over end-user features. Hell, they’ve even gone out of their way to outlaw or obviate writing code that does common, useful things.

/end rant
Topic archived. No new replies allowed.