Function to find Absolute Frequency of words in paragraphs

Hi my friends, I come with another challenge. I have a function that take a paragraph and line by line analyze the frequency of a "X" word. The problem I have whit the function is he found ALL words. For example: I search for the word "data" but the function also give me "datas" "primary-data" "primary_datas" etc... Wherever the word "data" is present :/ I just need the word "data" to be counted..

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
double calcAbsoluteFreq(string &paragraph, string &word) {

	istringstream tempStr(paragraph);
	string lineParagraph;
	double frecuAbs=0;

	while (getline( tempStr, lineParagraph)) {
		
		// Find the absolute frequency of the word across the line
		string::size_type word_pos( 0 );
		while ( word_pos!=string::npos ) {
			word_pos = lineParagraph.find(word, word_pos);
			if ( word_pos != string::npos ) {
				frecuAbs++;
				// do the next search after this word in the same line
				word_pos += word.length();
			
			}
		}
	}

return frecuAbs;
}


Any suggestions?

Cheers!!!!...

Mac
Last edited on
Perhaps add in a condition to check that there is whitespace before the word data and that there is either whitespace or a full stop after it.
Hi Muckle ewe, thanks.. I tried to do that, but without success.

1
2
3
4
5
6
7
8
9
10
11
12
13
while (getline( tempStr, lineParagraph)) {
	string::size_type word_pos( 0 );
	while ( word_pos!=string::npos ) {
	word_pos = lineParagraph.find(word, word_pos);
		if ( word_pos != string::npos ) {
			if ((lineParagraph.substr(word_pos-1)==SPACE) &&
 (lineParagraph.substr(word_pos+word.length())==SPACE)) { // I tried whit word_pos+word.length()+1 to.
				frecuAbs++;
				}
			word_pos += word.length();
		}
	}
}


I have a std::out_of_range there.. Probably becouse word_pos += word.length(); attempts to add the length of the word again..
Any idea to control that??..

Cheers
Last edited on
I'm really stuck with this function, somebody?? please..
You're using substr wrongfully. Look at this http://www.cplusplus.com/reference/string/string/substr/

if you really want a word that starts and ends with space, why not doing it so:

word_pos = lineParagraph.find(' ' + word + ' ', word_pos);
Hi coder777, thanks!!.. yes you right.. That work!!.. but i have another problem :/

' '+word+' ' is ok but if i have "data," or "data-" or "data/" or "data." this are "data" to.. :/

Maybe before the loop i have to apply a token with delimiters=" ,.-/\\\t". There is a easy way to do that??..

Thanks!!...

Cheers
What about using regex?
Although i am not an expert, it could be something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <limits>

int main()
{
    std::string strLine = "That's one small step for a man, one giant leap for mankind.";
	std::string word = "one";
    std::match_results<std::string::const_iterator> result;
    std::regex pattern("(\\b" + word + "\\b)+");
    std::regex_search(strLine, result, pattern);
    
    if(!result.empty())
    {
		std::cout << "word \"" << result.str() << "\" freq " << result.size() << std::endl;
    }

    std::cout << "Press enter to exit..." << std::endl;
    std::cin.ignore( std::numeric_limits< std::streamsize >::max(), '\n' );
    return 0;
}


But you must make some function to check "word" if it has regex metacharacters in it and "slash escape" those before passing to regex pattern.
Hi morando, thanks, the truth is I do not know regex, never did anything with it, you speak a little Chinese haha.. Well, I'll try...

Cheers!!...
You may try that regex. For learning purpose it's certainly not the worst thing to do or...

but i have another problem
I suspected that while i wrote

I guess "/word/" is also a word?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
if ( word_pos != string::npos ) {
bool begin = (word_pos == 0);
if(! begin)
{
  static const begin_indicator = " ,.-/\\\t"; // Maybe there's a more appropriate place?
  begin = (begin_indicator.find(word[word_pos - 1]) != string::npos); // if there's such a delimiter the word begins
}
bool end = ((word_pos + word.size()) >=  lineParagraph.size());
if(! end)
{
  static const begin_indicator = " ,.-/\\\t";  // Maybe there's a more appropriate place?
  begin = (begin_indicator.find(word[word_pos + word.size() + 1]) != string::npos); // if there's such a delimiter the word ends
}
if(begin && end)
  frecuAbs++;
}
Last edited on
Hi coder777 yes, "/word/" is also a "word" but no "words" or "mywords". I try with your code but begin && end are never true.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
double calcAbsoluteFreq(string &paragraph, string &word) {


	istringstream tempStr(paragraph);
	string lineParagraph;
	double frecuAbs=0;
	//const string SPACE(" ");
	//const string NUL("\0");
	const string begin_indicator = " ,.-/\\\t"; // Maybe there's a more appropriate place?

	while (getline( tempStr, lineParagraph)) {
		
		string::size_type word_pos( 0 );
		while ( word_pos!=string::npos ) {
			//word_pos = lineParagraph.find(' ' + word+ ' ', word_pos);
			word_pos = lineParagraph.find(word, word_pos);
			 
			 if ( word_pos != string::npos ) {
				bool begin = (word_pos == 0);
				if(!begin) {
					//cout << "inside begin" << endl;
					
					begin = (begin_indicator.find(word[word_pos - 1]) != string::npos); // if there's such a delimiter the word begins
				}
				bool end = ((word_pos + word.size()) >=  lineParagraph.size());
				if(!end) {
					//cout << "inside end" << endl;
					
					begin = (begin_indicator.find(word[word_pos + word.size() + 1]) != string::npos); // if there's such a delimiter the word ends
				}

				if(begin && end) { // He never in here, begin=0 and end=0 
					frecuAbs++;
					//cout << "inside frecuAbs++" << endl;
				}
			}
			 
			word_pos += word.length();

			 /*if (word_pos != string::npos) {

					frecuAbs++;
				//word_pos += word.length();
				word_pos = lineParagraph.find(word, word_pos + 1);
			} */
		}
	}

return frecuAbs;
}


Thanks
Last edited on
Yep, sorry didn't test it. On line 12 it must be

end = (begin_indicator.find(word[word_pos + word.size() + 1]) != string::npos);

hope it works now (maybe you find a better name than 'begin_indicator' for both)

Do you understand that code?
Yes, I understand your code coder777. And i also thought it was wrong and change it but still does not work.

(begin && end) are always false and never in to add 1 to frecuAbs++..

Traumatic problem haha...


Thanks coder777... I will keep trying...
Last edited on
Traumatic problem haha...
Um, no. It's a matter for the debugger (I can really recommend this thing). And voila:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
if ( word_pos != string::npos ) {
bool begin = (word_pos == 0);
if(! begin)
{
  begin = (begin_indicator.find(lineParagraph[word_pos - 1]) != string::npos); // if there's such a delimiter the word begins
}
bool end = ((word_pos + word.size()) >=  lineParagraph.size());
if(! end)
{
  end = (begin_indicator.find(lineParagraph[word_pos + word.size() + 1]) != string::npos); // if there's such a delimiter the word ends
}
if(begin && end)
  frecuAbs++;
}
Thanks coder777, that work fine in a few lines or paragraphs where i know the delimiters " ,.-/\\\t".. But no resolve the problem in a big text becouse when i try to find a word, let say "data", the delimiters are out of control and depend on the encoding of the text (UTF-8, ISO......). If i use, for example notepad++ to found "data" in the big text (18Mb), he have a function to find only complete words or not. Its very dificult to find a complete word "data" ignoring all the delimiters that may have (depending on the coding)..
I know for example that the word "data" has a frequency of 93, "system" 158, "time" 134, etc. always using notepad++ to know a priori and manually.

Any idea?

Thanks again!!!....

Cheers

Any idea?


You can use this library:
http://utfcpp.sourceforge.net/
to convert from utf8 -> utf16LE (if you are on windows?) and then search words with coresponding regex library functions with wchar_t type ("wregex") with 'w' prefix.

regex == Regular expression:
http://en.wikipedia.org/wiki/Regular_expression
Last edited on
HI morando, im in linux, ubuntu 10.4. I think is no problem to convert from ISO to UTF-8 i do this with "iconv", the problem is only to find a complete word frequency. Maybe the solution is in regex, unfortunately I don't know how.
Thanks my friend..
So, first check this:
http://www.cplusplus.com/reference/std/locale/
and
http://www.cplusplus.com/reference/std/locale/isalpha/

With the function above you can determine if there's an alpha depending on the character set.

So ok, if you don't want delimiters you can check if a word continues rather than if it ends:

begin = (! isalpha(lineParagraph[word_pos - 1]));

so you determine what a word is instead of what a word not is.

if you want that "word_1" is a word you need to write (! isalnum(lineParagraph[word_pos - 1])) && (lineParagraph[word_pos - 1] != '_'); and so on
Hi coder777, thanks again!!!.. Yes, i will try that right now.. Yesterday I was watching and learning about regex, if all this fails maybe i have to redo everything with regex... Well, nobody said it was easy :)

Thanks very much!! coder777 and morando..

Hands On
Well, finally the "isalpha" fuction solved the problem, not perfectly but it works. If I search for example "system" is also part of "systems" because the character "s" at the end, is alphabetical.. Anyway I do not care for this case. Probably the perfect solution is "regex", with more time, i try to implement with regex... Well.. Here's how I did it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
if ( word_pos != string::npos ) {
	bool begin = (word_pos == 0);
	if(! begin) {
		begin = (!isalpha(lineParagraph[word_pos - 1])); // if the character before the word is not alphabetical
		//cout << "This is in BEGIN->" << lineParagraph[word_pos - 1] << "<-" << endl;
	}
	bool end = ((word_pos + word.size()) >=  lineParagraph.size());
	if(! end) {
		end = (!isalpha(lineParagraph[word_pos + word.size()])); // if the character after the word is not alphabetical
		//cout << "This is in END->" << lineParagraph[word_pos + word.size()] << "<-" << endl;
	}	
	if(begin && end) // If begin and end are not alphabetical, is my word
		frecuAbs++;
} 


Thanks very very much coder777 and morando...

Cheers!!

Mac
Topic archived. No new replies allowed.