Im looking to write a program that reads in a text file. It needs to count each instance of a certain words then prints out the most used words in the file. I know how to read in the file but cant figure how to read and count the words. Thanks for the help.
You could use a map (or similar structure.) The keys are of type string (i.e. the words) and the values are of type int (i.e. the count, how many of that word has been counted.)
so this counts for every instance of every word? Now periods, commas, etc should all be ignored. Also if the words with punctuation inside of the word ie they're will count as two words: "they" and "re"
That is part of the trick when you "read a word from the file".
Do you know of a way to read a word?
Do you know how to strip punctuation and non-alphabetic characters from a word?
The std::string class can help you, as can some simple <algorithm>s. At the very worst, you can just do it in a loop yourself.
Yea, as Duoas said, I'm afraid you'll have a parse the file in the manner you want (need.) My algorithm was to show you a way of counting words, not identifying them. For that I would go with something like:
1 2 3 4 5 6 7 8 9
...
while( !file.eof() ) {
...
string word = //read until a whitespace (or possibly control) character: ' ', '\n', '\t', '\r', ...
if( /*word has no letters*/ )
continue;
//else, we have a word
}
...
Words cannot (by definition) span lines, so read your file one line at a time, then break the line down into a string. I recommend a simple transformation to convert everything that isn't an alphanumeric character into a space, then just use a stringstream and the >> string operator to get words:
1 2 3 4 5 6 7 8 9 10 11 12 13
string s;
while (getline( f, s ))
{
// Convert the string to lowercase (so "Foo" and "foo" are not distinct words)
transform( s.begin(), s.end(), s.begin(), ptr_fun <int, int> ( tolower ) );
// Replace everything that is not an alphanumeric with a space
replace_if( s.begin(), s.end(), not1( ptr_fun <int, int> ( isalnum ) ), ' ' );
// Now we'll use the modified line to parse "words"
istringstream ss( s );
// For every "word", add/increment it to/in our associative array
while (ss >> s)
counts[ s ]++;
}
No, I guess you don't have to use a map; have you though of a easier/better way? That is the way I would do it; it's straight forward; however, if another way make sense to you do it that way instead.
Think about how you would do the given task "by hand." Explain it to me (I want to follow how you think), then do your best to translate that into C++, I can help you fill in the gaps.
Ok well i figure I'd open the file, then start reading it line by line. As I read in the file I look and count every word (without punctuation and spaces) and keep track of the counts of each word. I figure i'll have to do some comparing algorithm to leave out punctuation and spaces as well as check each unique word. I dont know if I have to store each count or not. Finally once I have finally run through every the whole file I would print out the 10 top used words.
How? Remember, you are no longer in C++ world. Pretend you have a sheet of paper in front of you. How do you "keep track of the counts of each word?"
If it helps, start with this example file and actually do the project as if you were the program. Then explain the steps:
As far as the laws of mathematics refer to reality, they are not certain,
and as far as they are certain, they do not refer to reality.
- Albert Einstein
Well I'm making a list. Read the first word put it in a list. Read the second one and if Its the same word make the total count for the word two if not, then that word will be another unique word. Read the third word and repeat.
...or, another fairly do-able solution would be to use vectors, if your more familiar with them. (If you noticed in your "by hand" algorithm you said you needed to add a new word to your list, i.e. you need something with a dynamic length.)
At this point I'm learning it too slowly if you could help me with the code. I am already late in turning it in. I have plenty more problems in dealing with it. I just need to get this one done now. I do understand the usefulness of maps I just can't put it into code now
Lines 51-66 are using a binary search to insert the next word into the list of "top words" so that the list of "top words" is in "highest count"-to-"lowest count" order. Then it checks to make sure the word is in the top ten, and if it is remove the last word on the list. No point in storing more then 10 "highest words."
Please let me know if you don't understand something...
#include <iostream>
#include <fstream>
#include <string>
#include <map>
#include <vector>
usingnamespace std;
string ask(string msg) {
string ans;
cout << msg;
getline(cin, ans);
return ans;
}
int main() {
ifstream fin( ask("Enter file name: ").c_str() ); //open an input stream on the given file
if( fin.fail() ) {
cerr << "An error occurred trying to open a stream to the file!\n";
return 1;
}
map<string, int> wordCounts; //word |-> number of appearances
string entity; //an almost-word that may contain non-alphabetical characters
while( fin >> entity ) {
vector<string> words;
//split up the entity if it contains non-alphabetical characters
for( int i = 0, a = 0; i < entity.length(); i++ ) {
char &c = entity[i];
if( c < 'A' || (c > 'Z' && c < 'a') || c > 'z' ) {
string word = entity.substr(a, i - a);
a = i + 1;
if( word.length() > 0 )
words.push_back(word);
}
}
//tally all words in this entity
for( int i = 0; i < words.size(); i++ )
wordCounts[words[i]]++;
}
fin.close();
vector<string> topWords; //a list of the most used words
const size_t MAX_WORDS = 10;
for( map<string, int>::iterator iter = wordCounts.begin(); iter != wordCounts.end(); iter++ ) {
//"iter->first" is a word from the file
//"iter->second" is that word's count
int pos = 0, lim = topWords.size();
while( pos < lim ) {
int i = (pos + lim) / 2;
int count = wordCounts[topWords[i]];
if( iter->second > count )
lim = i;
elseif( iter->second < count )
pos = i + 1;
else
pos = lim = i;
}
if( pos < MAX_WORDS ) {
topWords.insert( topWords.begin() + pos, iter->first );
if( topWords.size() > MAX_WORDS )
topWords.pop_back();
}
}
//print the top 10 words
for( int i = 0; i < topWords.size(); i++ )
cout << '(' << wordCounts[topWords[i]] << ")\t" << topWords[i] << endl;
return 0;
}