[Problem] Extracting lines of text from file

Pages: 12
Hello Community,

I really need help with the following problem, that, try as i might, can't get solved ... My current program is supposed to find and display the number of words found in a text file, based on user input. So, if the word is 'the', it should display the exact number.

The other part is to extract lines containing this word. If the word is 'and', with the following condition:

1
2
3
4
			if (found = tmp.find(" " + sWord + " ") != string::npos)
			{
				cout << text + "\n";
			}


It would display exactly only words that are separated by white-space at the beginning and end of the word, and only sentences are displayed matching the word.

If, however, the word is - say - "Players", and the word is at the beginning of a sentence, obviously, with the above condition, this sentence is never displayed. If my condition is like this:

1
2
3
4
		if (found = tmp.find(sWord + " ") != string::npos)
			{
				cout << text + "\n";
			}


Then, if I search for 'and', it would also display a sentence containing 'understand', which is not what I want ... And this is what I need help with. Also, if there is any better way to go about doing it, meaning without opening the file twice (the upper - to lowercase part will be one function for both tasks), to get a correct result for word count AND the extraction of sentences containing the word, I'm all ears ... :))

Lastly here is my code so far [please scroll down for the test-text]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#include <algorithm>
#include <string>
#include <fstream>
#include <iostream>
#include <iterator>

using std::string;
using std::fstream;
using std::cin;
using std::flush;
using std::cout;
using std::endl;
using std::ios;

int main()
{
	string sWord = " ";
	int cnt = 0;
	string tmp = " ";
	fstream txt("short.txt", ios::in);

	string delimList = ":,!.?)(\"";

	string text = "";
	string wList = "";

	cout << "enter sW: ";
	cin >> sWord;

	if (!txt.fail())
	{
		while (txt >> tmp)
		{
			wList = tmp + "\n";
			for (unsigned int i = 0; i < delimList.size(); i++)
			{
				wList.erase(remove(wList.begin(), wList.end(), delimList.at(i)), wList.end());
				tmp.erase(remove(tmp.begin(), tmp.end(), delimList.at(i)), tmp.end());
			}

			for (unsigned int i = 0; i < tmp.size(); i++)
			{
				tmp[i] = tolower(tmp[i]);
			}
		
		size_t found = tmp.find(sWord);

			if (found !=string::npos)
			{
				if (tmp == sWord)
				{
					cnt++;
				}			
			}
		}
	}
	txt.close();

	txt.open("short.txt", ios::in);
	if (!txt.fail())
	{
		while (getline(txt, tmp))
		{
			text = tmp;

			for (unsigned int i = 0; i < tmp.size(); i++)
			{
				tmp[i] = tolower(tmp[i]);
			}
	
			size_t found = tmp.find(sWord+"");

			if (found = tmp.find(sWord + " ") != string::npos)
			{
				cout << text + "\n";
			}
		}
	}
	txt.close();

	cout << cnt;
	cin.get();
	cin.ignore();
	return 0;
}


test text:

Players of a video game do not look at the underlying code but at dynamically
generated audiovisual and tactile results based on it. They look at the mediated
plane and see the performance of the code. The code itself stays hidden behind
elaborate virtual worlds and interfaces, and the only time one might encounter it is when an error crashes the program and a debug message points to certain lines of broken code. Players do not have to understand the logic of the code but of the mediated game world. "Beyond the fantasy, there are always the rules,"argues Turkle (1984, 83), but from the vantage point of a player's experience, it is the fictional plane where the player makes sense of the game, the space of personal interpretation and assessment. Overwhelmingly, most game players stay on the level of the fantasy world when playing a video game. For example, they would not realize the fundamental logical difference between a version of Pac-Man (Iwatani 1980) running on the original Z80 microprocessor of the arcade board or a Java or C++ version of the game emulated on a Pentium processor under Windows.

(Michael Nitsche - Video Game Spaces Image, Play, and Structure in 3D Game Worlds)
(Copyright: 2008 Massachusetts Institute of Technology)
For the first part of the problem you can try this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <iostream>
#include <fstream>
#include <string>
#include <cctype>

using namespace std;

string stripNonAlpha(string& input) // non-const for converting to uppercase
{
  string output;
  output.reserve(input.size());

  for (char ch: input)
    if (isalpha(ch))
      output.push_back(ch);

  return output;
}

int countWord(const string& filename, const string& needle)
{
  int count = 0;
  ifstream src(filename);
  if (!src)
    return -1; // better to throw an exception but to keep it simple....

  string input;
  while (src >> input)
  {
    string word = stripNonAlpha(input);
    if (word == needle)
      count++;
  }

  return count;
}

int main()
{
  const string filename = "test.txt";
  cout << "Enter search word: ";
  string input;
  cin >> input;
  int numFound = countWord(filename, input);
  if (numFound == -1)
  {
    cout << "Word not found." << "\n\n";
  }
  else
  {
    cout << "The word " << input << " was found " << numFound << " times\n\n";
  }
  return 0;
}

OUTPUT

Enter search word: the
The word the was found 22 times


It you want to search case-insensitive you need to convert needle and input either to lowercase or uppercase in countWord

Hello Thomas!

Thank you very much for your reply and code example! What i'd still need help with is the sentence extraction part. :)

Maybe it would help to help me if i describe how I envision things to work (and just don't know how to pull it off).

What I need most is to find a way to tell C++ -> Hey, C++, if a word is at the beginning of a line, and the word is a match, give me that line. Or Hey, C++, if that word is part of another word, I do not want it, leave it out!

My general idea, since my version doesn't deliver the expected result, is working with indices or positions to extract the sentences, if there is/are matches of that particular word. But, how? Any ideas, hints, pseudo-code that points me in the right direction, is more than welcome. :-)
Last edited on
Here's an idea.
Read the file line by line.
Create a 'cleaned up' copy of the line. Something like this:
original:
Players do not have to understand the logic of the code but of the mediated game world.
 "Beyond the fantasy, there are always the rules,"argues Turkle (1984, 83), but from the vantage 

cleaned:
Players do not have to understand the logic of the code but of the mediated game world 
  Beyond the fantasy  there are always the rules  argues Turkle  1984  83   but from the vantage 

(I inserted a line break, that was all one line in original post).

Then use a stringstream to read one word at a time from that modified line. Compare the lowercase version of that word with the lowercase search word, in order to both count occurrences and identify lines of interest.

Use functions to carry out tasks such as cleaning up the line, or converting the word to lowercase.

Possibly you might consider storing the lines where the word was found (the original, unmodified version) in say a vector or some other temporary store. Then the output can be produced at the end from that temporary store.
Do you have restrictions? Are you allowed to use regex?
Chervil thank you for the input! And, YAY! Brilliant idea(s)!

Stringstream, I would use it, but I do not know anything about them. My book doesn't teach anything about this topic, at least in the current revision; Maybe the new one I ordered does. In the meantime, if I should consider using it, I would need some - 'For Dummies' - dumped down explanation how to go about using it. :)

Storing the words, though, in a vector, is what I have had in mind as well. Either for keeping a word list, which I can't seem to do without as far as counting the individual words is concerned, or really only for keeping sentences if a match is discovered.

Enoizat, hello! Well, there is no restriction as to what I can and cannot use. The only limiting factor is my understanding. As with stringstream suggested by Chervil, the same applies to regex, I know nothing about it.
Last edited on
Another idea
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
void toUpper(string& s)
{
  for (char &ch: s)
    ch = toupper(ch);
}

bool findWord(string& haystack, string& needle)
{
  toUpper(haystack);
  toUpper(needle);

  auto pos = haystack.find(needle);
  if (pos == string::npos)
    return false;

  // TODO
  // check if char befor and after needle == whitespace or other unwanted char
  return true; // dummy value just to compile
}


int main()
{
  string haystack = "For example, they would not realize the fundamental logical difference between a version of Pac-Man";
  string needle = "Pac";
  bool res = findWord(haystack, needle); // should be false
  if (res)
    cout << "ERROR";

  needle = "they";
  res = findWord(haystack, needle); // should be  true
  if (!res)
    cout << "ERROR";

  system("pause");
  return 0;
}
Stringstream, I would use it, but I do not know anything about them. My book doesn't teach anything about this topic, at least in the current revision; Maybe the new one I ordered does. In the meantime, if I should consider using it, I would need some - 'For Dummies' - dumped down explanation how to go about using it. :)


"stringsteam for dummies" code example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#include <string>
#include <iostream>
#include <sstream>

using std::string;
using std::istringstream;
using std::ostringstream;
using std::cout;

int main()
{
    string line = "Players do not have to understand the logic "
                  "of the code but of the mediated game world "
                  "  Beyond the fantasy  there are always the rules  "
                  "argues Turkle  1984  83   but from the vantage ";
                                    
    cout << line << '\n';
    
    
    // 1. input with a stringstream
    istringstream ss(line);
    
    string word;
    
    while (ss >> word)
    {
        cout << word << '\n';
    }
    
    
    // 2. reuse existing stringstream
    
    ss.clear();              // reset flags
    ss.str("3.14159 76543"); // change contents
    double a = 0;
    int    b = 0;
    ss >> a >> b;
    cout << "a = " << a << "    b = " << b << '\n';
    
    
    // 3. output with a stringstream
    ostringstream out;
    
    out << "Hello world " << 123+456 << "  " << '\n' << "Friday";
    
    // get contents of out and display it.
    cout << out.str();

}

Basically, you can use a stringstream to do input or output as though you were using a file. But instead the contents are stored in a string.

Hints:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#include <fstream>
#include <iostream>
#include <limits>
#include <regex>
#include <string>
#include <utility>
#include <vector>


std::pair<int, std::vector<std::string>>
    fillWithMatches(std::ifstream& source, 
                    std::vector<std::string>& matches,
                    const std::string& searched);
void waitForEnter();


int main()
{
    bool again {false};
    do {
        std::cout << "Please give me the word to be found (no spaces!): ";
        std::string tobefound;
        std::cin >> tobefound;
        std::string filename("short.txt");
        std::ifstream infile(filename);
        std::vector<std::string> matches;
        auto result = fillWithMatches(infile, matches, tobefound);
        std::cout << "\nFound " << result.first << " matches in "
                  << matches.size() << " lines.\nDetails:\n";
        for(const auto& s : result.second) { std::cout << "--> " << s << '\n'; }
        infile.close();
        std::cout << "\nDo you want to perform another check [y, n]? ";
        char answer {'n'};
        std::cin >> answer;
        std::cin.ignore(1);
        if('y' == answer) { again = true; }
        else              { again = false; }
    } while(again);
    waitForEnter();
    return 0;
}


std::pair<int, std::vector<std::string>>
    fillWithMatches(std::ifstream& source, 
                    std::vector<std::string>& matches,
                    const std::string& searched)
{
    std::pair<int, std::vector<std::string>> result;
    std::string line;
    std::regex reg("\\b" + searched + "\\b", 
              std::regex_constants::ECMAScript | std::regex_constants::icase);
    while(std::getline(source, line)) {
        std::smatch sm;
        if(std::regex_search(line, sm, reg)) {
            result.first += sm.size();
            matches.push_back(line);
        }
    }
    result.second = matches;
    return result;
}

void waitForEnter()
{
    std::cout << "\nPress ENTER to continue...\n";
    std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
}

output:
Please give me the word to be found (no spaces!): player

Found 1 matches in 1 lines.
Details:
--> Players do not have to understand the logic of the code but of the mediated game world. "Beyond the fantasy, there a
re always the rules, "argues Turkle (1984, 83), but from the vantage point of a player's experience, it is the fictional
 plane where the player makes sense of the game, the space of personal interpretation and assessment.

Do you want to perform another check [y, n]? y
Please give me the word to be found (no spaces!): players

Found 3 matches in 3 lines.
Details:
--> Players of a video game do not look at the underlying code but at dynamically generated audiovisual and tactile resu
lts based on it.
--> Players do not have to understand the logic of the code but of the mediated game world. "Beyond the fantasy, there a
re always the rules, "argues Turkle (1984, 83), but from the vantage point of a player's experience, it is the fictional
 plane where the player makes sense of the game, the space of personal interpretation and assessment.
--> Overwhelmingly, most game players stay on the level of the fantasy world when playing a video game.

Do you want to perform another check [y, n]? n

Press ENTER to continue...


short.txt:
Players of a video game do not look at the underlying code but at dynamically generated audiovisual and tactile results based on it.
They look at the mediated plane and see the performance of the code.
The code itself stays hidden behind elaborate virtual worlds and interfaces, and the only time one might encounter it is when an error crashes the program and a debug message points to certain lines of broken code.
Players do not have to understand the logic of the code but of the mediated game world. "Beyond the fantasy, there are always the rules, "argues Turkle (1984, 83), but from the vantage point of a player's experience, it is the fictional plane where the player makes sense of the game, the space of personal interpretation and assessment.
Overwhelmingly, most game players stay on the level of the fantasy world when playing a video game.
For example, they would not realize the fundamental logical difference between a version of Pac-Man (Iwatani 1980) running on the original Z80 microprocessor of the arcade board or a Java or C++ version of the game emulated on a Pentium processor under Windows.

(Michael Nitsche - Video Game Spaces Image, Play, and Structure in 3D Game Worlds)
(Copyright: 2008 Massachusetts Institute of Technology)

(I added a newline after every full stop)
Thomas thanks for providing a second example to count the number of words!

-------

Chervil thank you for the 'for Dummies' example. So, basically, sstream works just like any of the other stream operators, istream/ostream/fstream, with - I presume, the same member functions but applied to string objects? (I've been trying to dig into this topic, the google book search offered results. But many of those were of little use, and without your example I would have thought that it is mostly just used for:

a) Type conversion -> int -> string without the hassle of type casting and such
b) To format numbers. One example of which I found was an integer variable, a float, '3.4', which, via sstream, they were read into these variables, 3 into int, .4 into float.
c) To format output.

I will have to learn a bit more about stringstream and see how I will get it to work. Seems to me to be something I really should know and learn about, in general, and - if not for this project, certainly for some upcoming projects. :)

--------

Enoizat thank you very much for providing a working example! Seems easy enough, even for me, to understand what is going on. Only, if I may, what does this do?

1
2
  std::regex reg("\\b" + searched + "\\b",  
              std::regex_constants::ECMAScript | std::regex_constants::icase);


What is the meaning of "\\b" in there?
And icase? Meaning it is case insensitive, no matter the input? Fundamental/fundamental? Or does it do some sort of conversion of whatever text is being processed?-
Last edited on
Thanks to all your great suggestions and code examples I have been able to come up with a working solution! So thanks very very much for everything Thomas, Enoizat, and Chervile! Chervile, you being named last, but your ideas have put me in the right direction. :-)

So, here is the code, as yet without comments and without one or two additional features.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
#include <algorithm>
#include <sstream>
#include <string>
#include <fstream>
#include <iostream>

using std::string;
using std::ios;
using std::cin;
using std::cout;
using std::fstream;
using std::stringstream;

struct GetFile
{
	string fileName;		/* The file name	*/
	fstream textFile;		/* fstream object */

	GetFile(string fName = "")
	{
		fileName = fName;
	}

	~GetFile()
	{
	}
};

struct TextData
{
	int    countOcc;			/* Counts the occurrences of words found in the text		*/
	string searchWord;		/* Holds the word to search for in the text					*/
	string tmpText;			/* A temporary variable to hold text while it is read in */
	string getText;			/* Holds the text read in from the file						*/
	string searchResult;		/* Holds the search result											*/

	TextData(int cnt = 0, string sw = " ", string tmp = " ", string gTxt = " ", string sRes = " ")
	{
		countOcc = cnt;
		searchWord = sw;
		tmpText = tmp;
		getText = gTxt;
		searchResult = sRes;
	}

	~TextData()
	{
	}
};

int openFile(GetFile &, TextData &);
void changeCase(TextData &);
void removePunct(TextData &);
void performTextSearch(TextData &);
void getWordFreq(TextData &);

int main()
{
	GetFile fileData;
	TextData search;

	openFile(fileData, search);

	pauseSystem();
	return 0;
}

int openFile(GetFile &fileData, TextData &search)
{
	cout << "Please enter a file name: ";
	cin >> fileData.fileName;

	while (fileData.fileName.empty())
	{
		cin >> fileData.fileName;
	}

	fileData.textFile.open(fileData.fileName, ios::in);

	if (!fileData.textFile.fail())
	{
		cout << "Please enter a word to search for: ";
		cin >> search.searchWord;

			while (getline(fileData.textFile, search.tmpText))
			{		
				search.getText = ' ' + search.tmpText + ' ';

				removePunct(search);
				changeCase(search);
				performTextSearch(search);
				getWordFreq(search);
			}
	}
	else
	{
		cout << fileData.fileName << " could not be openend.";
	}
	fileData.textFile.close();

	cout << "\n\tThe word " << search.searchWord << " has been found " << search.countOcc << " times.";
	cout << "\n\n" << search.searchResult << " \n";

	return 0;
}

void changeCase(TextData &search)
{
	for (int i = 0; i < search.getText.size(); i++)
	{
		search.getText[i] = tolower(search.getText[i]);
	}
}

void removePunct(TextData &search)
{
	string delims = "\"',.:()-!?";

	for (int i = 0; i < delims.size(); i++)
	{
		search.getText.erase(remove(search.getText.begin(), search.getText.end(), delims[i]),
			search.getText.end());
	}
}

void performTextSearch(TextData &search)
{
	if (search.getText.find(' ' + search.searchWord + ' ') != string::npos)
	{
		search.searchResult.append(search.tmpText + "\n");
	}
}

void getWordFreq(TextData &search)
{
	stringstream ss(search.getText);

	while (ss >> search.getText)
	{
		if (search.getText == search.searchWord)
		{
			++(search.countOcc);
		}
	}
}


Output:
https://drive.google.com/open?id=0B0zM1FCMMJf8X3dUcERWak51NVE

https://drive.google.com/open?id=0B0zM1FCMMJf8RkxLMmt5SW44VEk

https://drive.google.com/open?id=0B0zM1FCMMJf8dXBOUzlKenlrWG8
Last edited on
Deleted
Last edited on
Deleted
Last edited on
Deleted
Last edited on
Deleted
Last edited on
Deleted
Last edited on
Deleted
Last edited on
Deleted
Last edited on
Deleted
Last edited on
Deleted
Last edited on
Pages: 12