Map and word count help.
Mar 26, 2014 at 2:44am UTC
I am writing a program to search through a .txt file and count how many times each word occurs in the file. My code works however It's not to my liking. It counts words but will create separate counts for eg.) Your, Your's, your, and your's, etc...
I am wanting to streamline the codes so that it will strictly find the word i'm looking for.
Is there any way to accomplish this?
On a side note, once i have my count, I am wanting to manipulate the .txt file further to remove word's with high counts. I am thinking of exporting the count list to a .txt file then running through it separately referencing it back to the Original Text File, putting the words back in order. (without high count words.)
Any Ideas?
Sorry for the rambling.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
#include <iostream>
#include <string>
#include <iomanip>
#include <fstream>
#include <cassert>
#include <map>
#include <vector>
#include <conio.h>
using namespace std;
void main()
{
ifstream fin;
string filepath, filename, FILE, aWord, OFILE, ofilename;
map<string,int > WordCount;
filepath = "C:\\Users\\Connor\\Desktop\\JARVIS\\" ;
filename = "VoiceData1.txt" ;
ofilename = "VoiceFilter1.txt" ;
FILE = (filepath + filename);
OFILE = (filepath + ofilename);
fin.open(FILE);
if (!fin.is_open())
{cout << "File import failed." ;
}
ofstream fout;
fout.open(OFILE);
assert(fout.is_open());
int count=0;
while (fin>>(aWord))
{
//iterator<map<string,int>,> a;
if (WordCount.find(aWord) == WordCount.end()){
WordCount.insert(WordCount.end(), pair<string,int >(aWord,1));
}else {
WordCount.at(aWord) += 1;
}
count++;
}
cout << count;
_getch();
vector<map<string,int >::iterator> wutwut;
for (auto word = WordCount.begin(); word != WordCount.end(); word++){
if (word->second >= 10)
{
wutwut.push_back(word);
}else {
cout << "The word " << word->first << " has count " << word->second << endl;
fout << word->first << word->second << endl;
}
}
for (int i = 0; i< wutwut.size(); i++){
WordCount.erase(wutwut.at(i));
count -=1;
}
cout << count;
_getch();
}
Mar 26, 2014 at 2:56am UTC
Mar 26, 2014 at 3:39am UTC
> It counts words but will create separate counts for eg.) Your, Your's, your, and your's, etc...
Some kind of stemming has to be applied to reduce the word to its root form.
http://www.comp.lancs.ac.uk/computing/research/stemming/general/
http://en.wikipedia.org/wiki/Stemming
This is the general idea; needless to say, the suffix-stripping stemmer is just a joke-implementation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
#include <iostream>
#include <string>
#include <cctype>
#include <map>
#include <fstream>
bool ends_with( const std::string& word, const std::string& suffix )
{
return word.size() > suffix.size() &&
word.substr( word.size() - suffix.size() ) == suffix ;
}
// crude stemmer
// rules:
// if the word ends in 'ed', remove the 'ed'
// if the word ends in 'ing', remove the 'ing' // wing ???
// if the word ends in ''s', remove the ''s'
// add more rules as appropriate
std::string strip_suffix( const std::string& word )
{
static const std::string suffixes[] = { "ed" , "ing" , "'s" } ;
for ( const std::string& s : suffixes ) if ( ends_with( word, s ) )
return word.substr( 0, word.size() - s.size() ) ;
return word ;
}
// make all lower case
std::string to_lower( std::string word )
{
for ( char & c : word ) c = std::tolower(c) ;
return word ;
}
// make all lower case, strip suffix, remove punctuation
std::string sanitize( std::string word )
{
std::string result ;
word = strip_suffix( to_lower(word) ) ;
for ( char c : word ) if ( std::isalnum(c) ) result += c ;
return result ;
}
int main()
{
const std::string filepath = "C:\\Users\\Connor\\Desktop\\JARVIS\\" ;
const std::string filename = "VoiceData1.txt" ;
std::map< std::string, int > wc ;
std::ifstream file( __FILE__ /*filepath + filename*/ ) ;
std::string word ;
while ( file >> word )
{
const std::string key = sanitize(word) ;
if ( !key.empty() ) ++wc[key] ;
}
// std::cout << "frequency count:\n-----------\n" ;
// for( const auto& p : wc ) std::cout << p.first << " - " << p.second << '\n' ;
/*
test cases:
test Test tested testing
case Case cases Case's
Your Your's your your's
walk Walking Walked
sleep slept Sleeping sleeps
*/
std::cout << "test - " << wc[ "test" ] << '\n'
<< "case - " << wc[ "case" ] << '\n'
<< "your - " << wc[ "your" ] << '\n'
<< "walk - " << wc[ "walk" ] << '\n'
<< "sleep - " << wc[ "sleep" ] << '\n'
<< "slept - " << wc[ "slept" ] << '\n'
<< "sleeps - " << wc[ "sleeps" ] << '\n' ;
}
http://coliru.stacked-crooked.com/a/94593828d262fa91
Last edited on Mar 26, 2014 at 3:41am UTC
Topic archived. No new replies allowed.