Map and word count help.

I am writing a program to search through a .txt file and count how many times each word occurs in the file. My code works however It's not to my liking. It counts words but will create separate counts for eg.) Your, Your's, your, and your's, etc...
I am wanting to streamline the codes so that it will strictly find the word i'm looking for.
Is there any way to accomplish this?

On a side note, once i have my count, I am wanting to manipulate the .txt file further to remove word's with high counts. I am thinking of exporting the count list to a .txt file then running through it separately referencing it back to the Original Text File, putting the words back in order. (without high count words.)
Any Ideas?

Sorry for the rambling.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <iostream>
#include <string>
#include <iomanip>
#include <fstream>
#include <cassert> 
#include <map>
#include <vector>
#include <conio.h>

using namespace std;

void main()
{
	ifstream fin;
	string filepath, filename, FILE, aWord, OFILE, ofilename;
	map<string,int> WordCount;


	filepath = "C:\\Users\\Connor\\Desktop\\JARVIS\\";
	filename = "VoiceData1.txt";
	ofilename = "VoiceFilter1.txt";
	FILE = (filepath + filename);
	OFILE = (filepath + ofilename);

	fin.open(FILE);
	if (!fin.is_open())
		{cout << "File import failed.";
	}
	ofstream fout;
	fout.open(OFILE);
	assert(fout.is_open());

	int count=0;
	while (fin>>(aWord))
	{
		//iterator<map<string,int>,> a;
		if(WordCount.find(aWord) == WordCount.end()){
			WordCount.insert(WordCount.end(), pair<string,int>(aWord,1));
		}else{
			WordCount.at(aWord) += 1;
		}
		count++;

	}
	
	cout << count;
	_getch();
	vector<map<string,int>::iterator> wutwut;
	for(auto word = WordCount.begin(); word != WordCount.end(); word++){
		if(word->second >= 10)
		{
			wutwut.push_back(word);
		}else{
			cout << "The word " << word->first << " has count " << word->second << endl;
			fout << word->first << word->second << endl;
		}
	
	}
	for(int i = 0; i< wutwut.size(); i++){
		WordCount.erase(wutwut.at(i));
		count -=1;
	}
	cout << count;
	_getch();

}
For starters, you can try converting each string you read from the file to lowercase.

http://stackoverflow.com/a/313990/1959975
> It counts words but will create separate counts for eg.) Your, Your's, your, and your's, etc...

Some kind of stemming has to be applied to reduce the word to its root form.
http://www.comp.lancs.ac.uk/computing/research/stemming/general/
http://en.wikipedia.org/wiki/Stemming

This is the general idea; needless to say, the suffix-stripping stemmer is just a joke-implementation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#include <iostream>
#include <string>
#include <cctype>
#include <map>
#include <fstream>

bool ends_with( const std::string& word, const std::string& suffix )
{
    return word.size() > suffix.size() &&
            word.substr( word.size() - suffix.size() ) == suffix ;
}

// crude stemmer
// rules:
// if the word ends in 'ed', remove the 'ed'
// if the word ends in 'ing', remove the 'ing' // wing ???
// if the word ends in ''s', remove the ''s'
// add more rules as appropriate
std::string strip_suffix( const std::string& word )
{
    static const std::string suffixes[] = { "ed", "ing", "'s" } ;

    for( const std::string& s : suffixes ) if( ends_with( word, s ) )
        return word.substr( 0, word.size() - s.size() ) ;

    return word ;
}

// make all lower case
std::string to_lower( std::string word )
{
    for( char& c : word ) c = std::tolower(c) ;
    return word ;
}

// make all lower case, strip suffix, remove punctuation
std::string sanitize( std::string word )
{
    std::string result ;
    word = strip_suffix( to_lower(word) ) ;
    for( char c : word ) if( std::isalnum(c) ) result += c ;
    return result ;
}

int main()
{
    const std::string filepath = "C:\\Users\\Connor\\Desktop\\JARVIS\\";
    const std::string filename = "VoiceData1.txt";
    std::map< std::string, int > wc ;

    std::ifstream file( __FILE__ /*filepath + filename*/ ) ;
    std::string word ;
    while( file >> word )
    {
        const std::string key = sanitize(word) ;
        if( !key.empty() ) ++wc[key] ;
    }

    // std::cout << "frequency count:\n-----------\n" ;
    // for( const auto& p : wc ) std::cout << p.first << " - " << p.second << '\n' ;

    /*
	test cases:
	test Test tested testing
	case Case cases Case's
	Your Your's your your's
	walk Walking Walked
	sleep slept Sleeping sleeps
    */

    std::cout << "test - " << wc[ "test" ] << '\n'
              << "case - " << wc[ "case" ] << '\n'
              << "your - " << wc[ "your" ] << '\n'
              << "walk - " << wc[ "walk" ] << '\n'
              << "sleep - " << wc[ "sleep" ] << '\n'
              << "slept - " << wc[ "slept" ] << '\n'
              << "sleeps - " << wc[ "sleeps" ] << '\n' ;
}

http://coliru.stacked-crooked.com/a/94593828d262fa91
Last edited on
Topic archived. No new replies allowed.