Parsing a large text file

I am working on a program that reads in a text file that has all of the written works of Shakespeare and then allows the user to conduct a word search to see how many times any given word comes up in each Shakespeare work. First I must parse the text into the separate books and then parse the paragraphs within each book so that I can search for the word in each paragraph in each work. There is a space between each paragraph, and every book starts with "BOOK: " and then the title of the book and ends with "The End". My instructor told us to use the following class definitions. I am confused as to how to identify each book and create a book vector. If anyone could offer some guidance it would be greatly appreciated!

class paragraph {
private:
string text;
public:
paragraph();
void setText(string p);
bool search(string word);
void display();
};

class book {
private:
string title;
vector<paragraph> paragraphs;
public:
book();
void setTitle(string title);
string getTitle();
int search(string word);
void add(paragraph p);
void clear();
int getParaCount();
void displayMatches(string search);
};


string readParagraph( istream& is )
{
string line;
string paragraph;
int lineNum = 0;;

//scan for the next paragraph
do {
getline( is , line );
} while (line.length() == 0 && !is.eof());

// return nothing if eof
if (is.eof()) {
return "";
}
// Get the next paragraph
do {
// Only put a newline after first line
if (lineNum++ > 0) {
paragraph += "\n";
}
paragraph += line;
getline(is, line);
} while (line.length() > 0 && !is.eof());

return paragraph;
}

The program should have a vector of book.
can you post the first and last 2-3 lines of one book ?

Is the code above provided by the professor ?
Last edited on
The first and biggest task is reading the file and extracting each book as a string. Splitting the book into paragraphs and searching comes next.
I assume the file will be rather large so reading it line by line seems the best option.
I would start by creating a short test file first - something like that.

Test.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
"Book: " Some title
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt
ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo
dolores et ea rebum.
"The End"
"Book: "Some other title
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor 
invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam 
et justo duo dolores et ea rebum.
"The End"
"Book: "Another different title
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy 
eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam 
voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
"The End"


Next step would be creating a function called ReadBook and diplaying the books on the screen - just as a test.
In main you create an input stream and call the ReadBook function until the end of the stream is reached.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

string ReadBook(ifstream& src);

int main()
{
  string filename = "Test.txt";
  string sep(60, '=');
  ifstream src(filename);

  while(src)
  {
    string book = ReadBook(src);
    cout << sep << endl;
    cout << book << endl;
    cout << sep;

  }
  
  system("pause");
  return 0;

}

string ReadBook(ifstream& src)
{
  // your code here
}


To read a book you read one line of text. I the line contains "Book:" the you know that you have reached the beginning of a book. Add this line to a string. You read more lines and add it to the string until the line contains "The End"
Try to implement it.
Thank you so much for responding! I have written a function that reads paragraphs line by line and am working on being able to identify if a line contains "Book: ". Once I have done that, how would you recommend I acquire the title - how would I make a command to take in only the words in the line after "Book: "? The function definition of readParagraph is below.

string readParagraph( istream& is )
{
string line;
string paragraph;
int lineNum = 0;;

//scan for the next paragraph
do {
getline( is , line );
} while (line.length() == 0 && !is.eof());

// return nothing if eof
if (is.eof()) {
return "";
}
// Get the next paragraph
do {
// Only put a newline after first line
if (lineNum++ > 0) {
paragraph += "\n";
}
paragraph += line;
getline(is, line);
} while (line.length() > 0 && !is.eof());

return paragraph;
}
To check if if a string contains another string you need to use the find function -
http://www.cplusplus.com/reference/string/string/find/
if you find the line then you need the substring function to get everything after Book:
http://www.cplusplus.com/reference/string/string/substr/

See if it works out.
Topic archived. No new replies allowed.