Reading in a file and counting

Pages: 12
Sep 7, 2011 at 5:46am
Im looking to write a program that reads in a text file. It needs to count each instance of a certain words then prints out the most used words in the file. I know how to read in the file but cant figure how to read and count the words. Thanks for the help.
Sep 7, 2011 at 6:10am
You could use a map (or similar structure.) The keys are of type string (i.e. the words) and the values are of type int (i.e. the count, how many of that word has been counted.)

Try something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <string>
#include <map>
...

int main() {
	map<string, int> counts;
	//open file stream
	...
	//LOOP (through file)
	{
		//read a word from the file
		counts[word]++; //increase that word's count
	}
	int maxWord;
	...
	//LOOP (through the map "counts")
	{
		if( counts[word] > counts[maxWord] )
			maxWord = word;
	}
	...
}

For more info: http://www.cplusplus.com/reference/stl/map
Sep 7, 2011 at 11:17pm
so this counts for every instance of every word? Now periods, commas, etc should all be ignored. Also if the words with punctuation inside of the word ie they're will count as two words: "they" and "re"
Sep 7, 2011 at 11:29pm
also how would I read each word from the file not counting spaces and punctuation?
Sep 8, 2011 at 1:24am
That is part of the trick when you "read a word from the file".
Do you know of a way to read a word?
Do you know how to strip punctuation and non-alphabetic characters from a word?

The std::string class can help you, as can some simple <algorithm>s. At the very worst, you can just do it in a loop yourself.

Don't forget to #include <cctype>.

Good luck!
Sep 8, 2011 at 8:54pm
Yea, as Duoas said, I'm afraid you'll have a parse the file in the manner you want (need.) My algorithm was to show you a way of counting words, not identifying them. For that I would go with something like:
1
2
3
4
5
6
7
8
9
...
while( !file.eof() ) {
	...
	string word = //read until a whitespace (or possibly control) character: ' ', '\n', '\t', '\r', ...
	if( /*word has no letters*/ )
		continue;
	//else, we have a word
}
...
Sep 8, 2011 at 11:39pm
Don't loop on file.eof().

Words cannot (by definition) span lines, so read your file one line at a time, then break the line down into a string. I recommend a simple transformation to convert everything that isn't an alphanumeric character into a space, then just use a stringstream and the >> string operator to get words:

1
2
3
4
5
6
7
8
9
10
11
12
13
  string s;
  while (getline( f, s ))
    {
    // Convert the string to lowercase (so "Foo" and "foo" are not distinct words)
    transform( s.begin(), s.end(), s.begin(), ptr_fun <int, int> ( tolower ) );
    // Replace everything that is not an alphanumeric with a space
    replace_if( s.begin(), s.end(), not1( ptr_fun <int, int> ( isalnum ) ), ' ' );
    // Now we'll use the modified line to parse "words"
    istringstream ss( s );
    // For every "word", add/increment it to/in our associative array
    while (ss >> s)
      counts[ s ]++;
    }
Sep 13, 2011 at 7:27pm
Ok so can this be done without a map structure or I have to use map?
Sep 13, 2011 at 8:13pm
No, I guess you don't have to use a map; have you though of a easier/better way? That is the way I would do it; it's straight forward; however, if another way make sense to you do it that way instead.
Sep 13, 2011 at 9:38pm
well no I'm just rusty I cant figure out how to even start. Is there a more basic way to do it?
Sep 14, 2011 at 3:58am
Think about how you would do the given task "by hand." Explain it to me (I want to follow how you think), then do your best to translate that into C++, I can help you fill in the gaps.
Sep 14, 2011 at 4:22am
Ok well i figure I'd open the file, then start reading it line by line. As I read in the file I look and count every word (without punctuation and spaces) and keep track of the counts of each word. I figure i'll have to do some comparing algorithm to leave out punctuation and spaces as well as check each unique word. I dont know if I have to store each count or not. Finally once I have finally run through every the whole file I would print out the 10 top used words.
Last edited on Sep 14, 2011 at 4:23am
Sep 14, 2011 at 5:32am
chanandler wrote:
...[I'd] keep track of the counts of each word.
How? Remember, you are no longer in C++ world. Pretend you have a sheet of paper in front of you. How do you "keep track of the counts of each word?"

If it helps, start with this example file and actually do the project as if you were the program. Then explain the steps:
As far as the laws of mathematics refer to reality, they are not certain,
and as far as they are certain, they do not refer to reality.
	- Albert Einstein
Sep 14, 2011 at 5:53am
Well I'm making a list. Read the first word put it in a list. Read the second one and if Its the same word make the total count for the word two if not, then that word will be another unique word. Read the third word and repeat.
Sep 14, 2011 at 7:55pm
That's a pretty good algorithm.

How will you store your list of (word,count)s?
How will you check to see if a word is in the list?
How will you add a (word,1) to the list?

These are the steps you will take to make the program work.
Sep 14, 2011 at 8:04pm
Well to store I guess I need some sort of structure. I want just a basic structure.

To check the word I would use a boolean word compare algorithm or something.

This is where im having problems. I guess it just depends on what structure I end up using.
Sep 15, 2011 at 3:52am
Now read up on maps: http://www.cplusplus.com/reference/stl/map
;)

...or, another fairly do-able solution would be to use vectors, if your more familiar with them. (If you noticed in your "by hand" algorithm you said you needed to add a new word to your list, i.e. you need something with a dynamic length.)
Last edited on Sep 15, 2011 at 4:51am
Sep 15, 2011 at 4:49pm
At this point I'm learning it too slowly if you could help me with the code. I am already late in turning it in. I have plenty more problems in dealing with it. I just need to get this one done now. I do understand the usefulness of maps I just can't put it into code now
Sep 15, 2011 at 5:45pm
this is what i have so far without removing spaces and symbols and printing it all out not just 10.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

#include <iostream>
#include <fstream>
#include <map>
#include <set>
#include <string>
using namespace std;

int main() {

    map<string, int> wordCounts; // Map of words and their frequencies
    string word;           // Used to hold input word.
    
    while (cin >> word) {
        wordCounts[word]++; 
        }
    }

    //-- Write count/word.  Iterator returns key/value pair.
    map<string, int>::const_iterator iter;
    for (iter=wordCounts.begin(); iter != wordCounts.end(); ++iter) {
        cout << iter->second << " " << iter->first << endl;
    }

	return 0;
}//end main
Sep 15, 2011 at 9:06pm
Lines 51-66 are using a binary search to insert the next word into the list of "top words" so that the list of "top words" is in "highest count"-to-"lowest count" order. Then it checks to make sure the word is in the top ten, and if it is remove the last word on the list. No point in storing more then 10 "highest words."
Please let me know if you don't understand something...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <vector>

using namespace std;

string ask(string msg) {
	string ans;
	cout << msg;
	getline(cin, ans);
	return ans;
}

int main() {

	ifstream fin( ask("Enter file name: ").c_str() ); //open an input stream on the given file
	if( fin.fail() ) {
		cerr << "An error occurred trying to open a stream to the file!\n";
		return 1;
	}
	
	map<string, int> wordCounts; //word |-> number of appearances
	string entity; //an almost-word that may contain non-alphabetical characters
	while( fin >> entity ) {
		vector<string> words;
		//split up the entity if it contains non-alphabetical characters
		for( int i = 0, a = 0; i < entity.length(); i++ ) {
			char &c = entity[i];
			if( c < 'A' || (c > 'Z' && c < 'a') || c > 'z' ) {
				string word = entity.substr(a, i - a);
				a = i + 1;
				if( word.length() > 0 )
					words.push_back(word);
			}
		}
		//tally all words in this entity
		for( int i = 0; i < words.size(); i++ )
			wordCounts[words[i]]++;
	}

	fin.close();
	
	vector<string> topWords; //a list of the most used words
	const size_t MAX_WORDS = 10;
	for( map<string, int>::iterator iter = wordCounts.begin(); iter != wordCounts.end(); iter++ ) {
		//"iter->first" is a word from the file
		//"iter->second" is that word's count
		int pos = 0, lim = topWords.size();
		while( pos < lim ) {
			int i = (pos + lim) / 2;
			int count = wordCounts[topWords[i]];
			if( iter->second > count )
				lim = i;
			else if( iter->second < count )
				pos = i + 1;
			else
				pos = lim = i;
		}
		if( pos < MAX_WORDS ) {
			topWords.insert( topWords.begin() + pos, iter->first );
			if( topWords.size() > MAX_WORDS )
				topWords.pop_back();
		}
	}
	
	//print the top 10 words
	for( int i = 0; i < topWords.size(); i++ )
		cout << '(' << wordCounts[topWords[i]] << ")\t" << topWords[i] << endl;
	
	return 0;
}
Use this code only as a template and tool to help complete your project. Please do not call it your own.
Pages: 12