Project Help

"You are expected to write a c++ console application which reads files from Reuters-21578 documents collection appeared on the Reuters newswire in 1987 and find Top 10 frequent words used in the newswire articles. The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contain 1000 documents, while the last (reut2-021.sgm) contains 578 documents.
Each article starts with an "open tag" of the form:
<REUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??>
where the ?? are filled in an appropriate fashion and ends with a "close tag" of the form:
</REUTERS>
Here is an example of these article entries in the file:
<REUTERS ... >
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> ... </UNKNOWN>
<TEXT> ...
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE> SALVADOR, Feb 26 - </DATELINE>
<BODY>Showers continued throughout
the week in the Bahia cocoa zone, alleviating the drought since
...
...
Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
Reuter
&#3;</BODY></TEXT>
</REUTERS>
Your program must be able to read words in articles in between <BODY> … </BODY> tags and insert each unique word into a suitable data structure.
Stopwords
A list of stopwords is supplied in stopwords.txt file. You should not count these words.
After reading and processing is over, your program must print “top 10” most frequent words used in these articles in descending order.
Additionally, the total time elapsed from the beginning of your code to the end of printing top 10 must be calculated and printed at the end of the execution.
Here is an example output:
<word1> <word count>
<word2> <word count>
<word3> <word count>
<word4> <word count>
<word5> <word count>.
.
.
.
.
.
.
<word10> <word count>
Total Elapsed Time: X seconds
Whole application can be implemented with console facilities (you do not need advanced GUI elements). The project consists of two parts.
A. Implementation of a data structure:
This will be a proper C++ class. You must be able to create many instances of this class.
(Please use no third-party libraries and C++ STL, Boost etc.) However, you can use, iostream, ctime, fstream, string like IO and string classes.
B. The main program itself. In the main function, you must create a list of words." I know this us not traditional in this forum but I need an explanation on how to think about this project, I've been trying with it since over a week. Help is appreciated.

 
Start by opening one of the files and reading the entire contents into a string.
[Duplicate thread]
This has been asked previously multiple times on this forum - with answers provided!
Hello KareemRj,

I do not know about the [Duplicate thread] or what has been said so far.

But I ask PLEASE keep the "stopwords.txt"file to your-self. You would not want anyone to know about this file or how to use it in your program.

This is not the solution to your problem and it does not work 100% yet, but it is a start and something to consider.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include <iostream>
#include <iomanip>
#include <string>
#include <limits>
#include <ctime>

#include <fstream>

int main()
{
    const std::string fileNames[]  // <---Mostly for future use.
    {
        "reut2-000.sgm",
        "reut2-001.sgm",
        "reut2-002.sgm"
        // <--- The rest of the file names here. Follow the above format.
    };
    const std::string inFileName{ fileNames[0] };  // <--- Put File name here.

    std::ifstream inFile(inFileName);

    if (!inFile)
    {
        std::cout << "\n File " << std::quoted(inFileName) << " did not open." << std::endl;
        //std::cout << "\n File \"" << inFileName << "\" did not open." << std::endl;

        return 1;
    }

    bool found{};
    std::string line;

    while (std::getline(inFile, line))
    {
        size_t pos = line.find("<BODY>");

        if (pos != std::string::npos)
        {
            std::cout << line.substr(pos + 6) << '\n';
            found = true;
        }

        else if (found && (line != " Reuter" && line != " REUTER"))
            std::cout << line << "\n";
        else
        {
            found = 0;
            std::cout << "\n";
        }
    }

	// <--- Keeps console window open when running in debug mode on Visual Studio. Or a good way to pause the program.
	// The next line may not be needed. If you have to press enter to see the prompt it is not needed.
	//std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');  // <--- Requires header file <limits>.
	std::cout << "\n\n Press Enter to continue: ";
	std::cin.get();

	return 0;  // <--- Not required, but makes a good break point.
}

You can see in the above code I was using the word " Reuter" to know when the text of the body was finished. Then I found that " REUTER" is also used and finely I found ""REUTER" at the end of the last line of the body.

It makes reading the file interesting.

Andy
@seeplus my apologies, but none of your answers were well understood and most of you misunderstood my questions.

Thank you Andy
@Andy, no sense listing all the filenames since they can be generated easily:

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <iostream>
#include <iomanip>
#include <sstream>

int main()
{
    for (int n = 0; n < 22; ++n)
    {
        std::ostringstream filename;
        filename << "reut2-" << std::setfill('0') << std::setw(3) << n << ".sgm";
        std::cout << filename.str() << '\n';
    }
}    

And here's the original post:
http://www.cplusplus.com/forum/windows/275322/
Last edited on
I do not know about the [Duplicate thread] or what has been said so far.


http://www.cplusplus.com/forum/windows/275322/
http://www.cplusplus.com/forum/beginner/275446/

Topic archived. No new replies allowed.