Hi my friends, I come with another challenge. I have a function that take a paragraph and line by line analyze the frequency of a "X" word. The problem I have whit the function is he found ALL words. For example: I search for the word "data" but the function also give me "datas" "primary-data" "primary_datas" etc... Wherever the word "data" is present :/ I just need the word "data" to be counted..
double calcAbsoluteFreq(string ¶graph, string &word) {
istringstream tempStr(paragraph);
string lineParagraph;
double frecuAbs=0;
while (getline( tempStr, lineParagraph)) {
// Find the absolute frequency of the word across the line
string::size_type word_pos( 0 );
while ( word_pos!=string::npos ) {
word_pos = lineParagraph.find(word, word_pos);
if ( word_pos != string::npos ) {
frecuAbs++;
// do the next search after this word in the same line
word_pos += word.length();
}
}
}
return frecuAbs;
}
Hi Muckle ewe, thanks.. I tried to do that, but without success.
1 2 3 4 5 6 7 8 9 10 11 12 13
while (getline( tempStr, lineParagraph)) {
string::size_type word_pos( 0 );
while ( word_pos!=string::npos ) {
word_pos = lineParagraph.find(word, word_pos);
if ( word_pos != string::npos ) {
if ((lineParagraph.substr(word_pos-1)==SPACE) &&
(lineParagraph.substr(word_pos+word.length())==SPACE)) { // I tried whit word_pos+word.length()+1 to.
frecuAbs++;
}
word_pos += word.length();
}
}
}
I have a std::out_of_range there.. Probably becouse word_pos += word.length(); attempts to add the length of the word again..
Any idea to control that??..
You may try that regex. For learning purpose it's certainly not the worst thing to do or...
but i have another problem
I suspected that while i wrote
I guess "/word/" is also a word?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
if ( word_pos != string::npos ) {
bool begin = (word_pos == 0);
if(! begin)
{
staticconst begin_indicator = " ,.-/\\\t"; // Maybe there's a more appropriate place?
begin = (begin_indicator.find(word[word_pos - 1]) != string::npos); // if there's such a delimiter the word begins
}
bool end = ((word_pos + word.size()) >= lineParagraph.size());
if(! end)
{
staticconst begin_indicator = " ,.-/\\\t"; // Maybe there's a more appropriate place?
begin = (begin_indicator.find(word[word_pos + word.size() + 1]) != string::npos); // if there's such a delimiter the word ends
}
if(begin && end)
frecuAbs++;
}
Um, no. It's a matter for the debugger (I can really recommend this thing). And voila:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
if ( word_pos != string::npos ) {
bool begin = (word_pos == 0);
if(! begin)
{
begin = (begin_indicator.find(lineParagraph[word_pos - 1]) != string::npos); // if there's such a delimiter the word begins
}
bool end = ((word_pos + word.size()) >= lineParagraph.size());
if(! end)
{
end = (begin_indicator.find(lineParagraph[word_pos + word.size() + 1]) != string::npos); // if there's such a delimiter the word ends
}
if(begin && end)
frecuAbs++;
}
Thanks coder777, that work fine in a few lines or paragraphs where i know the delimiters " ,.-/\\\t".. But no resolve the problem in a big text becouse when i try to find a word, let say "data", the delimiters are out of control and depend on the encoding of the text (UTF-8, ISO......). If i use, for example notepad++ to found "data" in the big text (18Mb), he have a function to find only complete words or not. Its very dificult to find a complete word "data" ignoring all the delimiters that may have (depending on the coding)..
I know for example that the word "data" has a frequency of 93, "system" 158, "time" 134, etc. always using notepad++ to know a priori and manually.
You can use this library: http://utfcpp.sourceforge.net/
to convert from utf8 -> utf16LE (if you are on windows?) and then search words with coresponding regex library functions with wchar_t type ("wregex") with 'w' prefix.
HI morando, im in linux, ubuntu 10.4. I think is no problem to convert from ISO to UTF-8 i do this with "iconv", the problem is only to find a complete word frequency. Maybe the solution is in regex, unfortunately I don't know how.
Thanks my friend..
Hi coder777, thanks again!!!.. Yes, i will try that right now.. Yesterday I was watching and learning about regex, if all this fails maybe i have to redo everything with regex... Well, nobody said it was easy :)
Well, finally the "isalpha" fuction solved the problem, not perfectly but it works. If I search for example "system" is also part of "systems" because the character "s" at the end, is alphabetical.. Anyway I do not care for this case. Probably the perfect solution is "regex", with more time, i try to implement with regex... Well.. Here's how I did it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
if ( word_pos != string::npos ) {
bool begin = (word_pos == 0);
if(! begin) {
begin = (!isalpha(lineParagraph[word_pos - 1])); // if the character before the word is not alphabetical
//cout << "This is in BEGIN->" << lineParagraph[word_pos - 1] << "<-" << endl;
}
bool end = ((word_pos + word.size()) >= lineParagraph.size());
if(! end) {
end = (!isalpha(lineParagraph[word_pos + word.size()])); // if the character after the word is not alphabetical
//cout << "This is in END->" << lineParagraph[word_pos + word.size()] << "<-" << endl;
}
if(begin && end) // If begin and end are not alphabetical, is my word
frecuAbs++;
}