I Need guidance for Text Extraction (New to C++)

Pages: 123
Oppss, sorry. This is my first time posting codes in the forum. I will take note on the using the tags when posting codes.

Do you mean that when I am trying to 'cout' for example the word "I am flubber", I can also use >> function? From the example above, it mean that i am reading 3 keywords, am I right?
For output ( cout = character output ) operations you need the << operator, for input >>.
My examples show how to read 3 words using 1. three variables, 2. a fixed size array, 3. a dynamic array
If you want to display the words you read you can use cout << after reading.
You can also output multiple words from the file using just one string variable
eg:
1
2
3
4
5
6
for ( int i = 0; i < words_to_read; i++ ) // repeat next words_to_read times
{
    string word; // declare a string ( if you want you can move this before the loop )
    file >> word; // read from file
    cout << word; // display in standard output
}
I think I am getting blur now due to over digestion of C++.

I wondered how can you be so pro with all this. :O

Nevertheless, I will try first with the examples and see if I can make the extraction works.

I will post my findings soon.

Thanks Bazzy!
The only way of getting good at something is to take as much experience as possible
If you'll try to combine my examples and you modify them just for experimentation, you'll learn many new things
Hi Bazzy,

I have tried editing the example and try to extract more than one keywords. But the output shows blank. Can you advice what did I do wrong?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
string keyword_director = "Director:";
string keyword_runtime = "Runtime:";
ifstream file ( "test.txt" );
string temp, temp2;

while ( file.good() )
{
    file >> temp, temp2; // read word from file

    if ( temp == keyword_director ) // check the  first keyword word
    if ( temp2 == keyword_runtime ) // check the another keyword
    {
        file >> temp; // read next word
        file >> temp2; // read another keyword
        
        cout << "The Director of the Movie is: " << temp << "The Runtime for the movie is: "<< temp2; // display both output
       
    }
I also found out that I need to implement them into a class. Is this possible?
Line 8 should be file >> temp >> temp2;
Notice that your program would go to line 16 only if it finds this in the text file: "Director: Runtime: xxxx yyyy".
The format "Director: xxxx Runtime: yyyy" would be more logical so what you need would be:
1
2
3
4
5
6
7
8
9
10
11
(pseudocode)
read temp string;
if temp == "Director"
    read temp char to skip ':'
    read next word
    display "The Director of the Movie is: " last word you read
    read next temp string
    if temp == "Runtime"
        read temp char to skip ':'
        read next word
        display "The Runtime for the Movie is: " last word you read     
To use a class I suppose you need to have to overload the >> operator and read stuff in the class members
eg:
1
2
3
4
5
6
7
8
9
10
11
12
13
class C
{
    private:
       string something;
    public:
      //...
      friend istream &operator >> ( istream &, C & ); // friend functions can access private members
};
istream &operator >> ( istream &is, C &obj )
{
    is >> obj.something; // use is to get the input instead of a specific ifstream and give the value to the class members
    return is; // return the modified stream
}
Oh my god! Using Class is getting more complicated. I will try on how to solve my first issue before implementing it to class. I need to understand the basic throughly first. Thanks.
Hi Bazzy,

I've been going through the explanation that you have posted regarding Flubber's codes, if for an example, based on Flubber's codes, excluding the search for the 2nd keyword, how do i go about in making it into a class? I am really lost about class and i really need help on it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
string keyword_director = "Director:";
ifstream file ( "test.txt" );
string temp;

while ( file.good() )
{
    file >> temp, temp2; // read word from file

    if ( temp == keyword_director ) // check the  first keyword word
    {
        file >> temp; // read next word
        
        cout << "The Director of the Movie is: " << temp << endl;
       
    } 

One more thing, just to clarify 1 thing with you, the class is invoke by the main codes or vice versa?
Thanks in advance!

:-)
The class is called by main or whatever function. Line 7 of your code doesn't need , temp2 ( It shouldn't even compile if temp2 isn't declared )
You need to build a class using the members you need.
This could be an example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class Movie
{
     private:
          std::string director;
          // whatever info you need
     public:
         friend istream & operator >> ( istream &, Movie & ); // for input
         friend ostream & operator << ( ostream &, const Movie & ); // for output
};
istream & operator >> ( istream &is, Movie &mov )
{
   //move input actions here
   return is;
}
ostream & operator << ( ostream & os, const Movie &mov )
{
    //output actions here eg:
   os << "The Director of the Movie is: " << mov.director;
   return os;
}

int main()
{
    ifstream file("filename");
    Movie myMovie;
    file >> myMovie; // Notice how you are using the operators you overloaded
    cout << myMovie;
    return 0;
}

Tutorials on classes:
http://www.cplusplus.com/doc/tutorial/classes/
http://www.cplusplus.com/doc/tutorial/classes2/
http://www.cplusplus.com/doc/tutorial/inheritance/
Hi all,

I can't believe that not only me trying out this "text scraping".
Meaning to say i am just like you guys!

But before i do the scraping, i would like to invoke html2text utility using c++.
I am doing this in the ubuntu platform os. i have tested the html2text utility and it is working fine which has given me the text files that i needed.
The utility is executed via ubuntu's terminal (if in windows, it is command prompt.).

Is it possible for me to execute the ubuntu's terminal inside my c++ codes so as to then execute the html2text utility via terminal.

I really have tried to find any clues searching the internet, and i cant really find the best solution.

Your help will be greatly appreciated.
Hi Bazzy,

I have tried editing my code to the following. I need the "Director:" as the keyword to extract the name of the director. That is why I never skip the ":"



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
string keyword_director = "Director:";
string keyword_runtime = "Runtime:";
ifstream file ("text.txt");
string temp, temp2;


    while ( file.good() )
    {
        file >> temp;
        if ( temp == keyword_director ) // check the  first keyword word
        {
            file >> temp; // read next word
            cout  << "The Director of the Movie is: " << temp << endl;
        }

        file >> temp2;
        
        if ( temp2 == keyword_runtime )
        {
            file >> temp2;
            cout << "The Runtime for the Movie is: " << temp2 << endl;
        }



However on the console output, it only show "The Runtime For the Movie is: Results"

My Director does not display. Did I enter anything wrong? Is my "read next temp string" declaration correct?
For the movie class example that you have shown, which line is the place I can enter my keywords such as "director:" so that it will find it?
If the file is formatted "Director: ... Runtime: ..." and you want to read 'Runtime' only if 'Director' was successful, you need to extend the line 10 if block so that it will include all 'Runtime' stuff ( until line 22 )
If that doesn't display anything for 'Director:' means that it didn't find that exact string in the file.
If in the file you have 'director:' or 'Director :' this code wouldn't work

@ mrTulang
If you need to launch a command from your program as if with the console you can use system("command"); notice that is better if you can avoid this ( http://www.cplusplus.com/forum/articles/11153/ )
Last edited on
Do you mean that class cannot use "Director:" as the keyword to find? I am now quite lost.
I mean that you may be looking for the wrong keywords. Can you post some lines of the file you are reading from?
Thanks bazzy!!!

Thanks for leading me the right way.

I have gotten the command flawlessly now!

Let me just show you the codes.

int main (int nNumberofArgs, char* pszArgs[])

{
int i;

if (system(NULL)); \\ this is just to show if system cant be found.
else exit (1);

i=system ("html2text -style pretty -nobs -o /home/user/Documents/test.txt /home/user/Documents/tt0373889.html"); \\this is the main one!

}

Again thanks much bazzy!
Hei sorry to disturb you again bazzy.

But how would i do it if i were to make this particular system("command") into a class.

And how would i call it in my main programming?

Currently im using the text extraction codes that you have given to flubber.
And i put the system("command") at the same main coding page, which i put the codes above the text extraction codes

So as i do not want my main codes to be in a mess! I would like to make the system("command") codes into a class which will be called in the main coding.




Hi Bazzy,

In my test.txt files, I got a whole chunk of sentences. But I already identify the unique keywords that can only be found once. For example,


Text File

This is a story of etc etc. I do not anything here. And blah blah.

Director: Alex Proyas

This is also another words and sentences not relevant.

Runtime: 111 Mins

And it goes on and on with unwanted information in the sentences.

Date: 12 July

And full of text.....


So I have identified the word "Director:" is the first keyword so I can Cout 'Alex'

Therfore i want to also Cout the word '111' with using "Runtime:" as the keyword.

And Date: and so for.

I am now stuck with showing more than 1 words (I want to show Alex Proyas instead of only Alex) for the director cout and also showing all the cout informations.

:( Still stuck infront of my laptop for 6 hrs straight figuring it out.

Ok so the general format is this:
some text
Keyword: values
some other text


To get the value use getline instead of >> so you can read everything untill the newline character ( '\n' ) is found
When you'll have to chose the class members use the same keywords you are looking for
Last edited on
Pages: 123