Object Oriented

Forum

Forum
General C++ Programming
Object Oriented

Pages: 12

Hello, I've been trying to read from a .sgm file a certain document. But it has tags and characters like "<" and ">" which should not be printed onto the console and I still haven't figured out how to remove them from the text I am trying to read in the file. Any insight?
Basically the article that the code is supposed to be read is tied in between 2 strings that are "<BODY" and "</BODY>". The code has to read between those 2 strings.

Ganado (6814)

See:
http://www.cplusplus.com/reference/string/string/find/
http://www.cplusplus.com/reference/string/string/substr/

#include <iostream>
#include <string>

int main()
{
    const std::string open_tag = "<BODY>";
    const std::string close_tag = "</BODY>";

    std::string text = "this is junk <BODY>This is an example</BODY>this is also junk";
    
    // TODO: Error checking if <BODY> or </BODY> tags do not exist
    auto pos = text.find(open_tag);
    auto endpos = text.find(close_tag, pos + open_tag.length());
    
    std::string inner_text = text.substr(pos + open_tag.length(), endpos - open_tag.length() - pos);
    
    std::cout << inner_text << '\n';
    
    return 0;
}

This is an example

Last edited on

jonnin (11443)

this is a problem that can be huge or trivial.
if you are doing xml or html type work, you may want a full answer, not a partial one, that can handle many tags, nested tags, and much complexity -- there are libraries to handle these common formats.
If you are dealing with a custom/hand rolled / unpopular format, and can't find a tool, but need to deal with complexity, you can use c++'s regular expressions and your own logic to unravel the tags. Typically it will work similar to a code indenter or bracket checker -- you use a stack to track your tags, and if the stack has left-overs they were mismatched.

If it is a simple file with a couple of tags and nothing too complicated, raw string processing is very fast and easy and can do the job. This is what the example above does, and if your problem description is the real problem in full, it is the right way to do it.

Your title is odd, what about OOP were you thinking? You can use an object to parse, but it isn't necessary (C++ allows code without objects). I can't think of a reason to make an object here, but there is no reason not to, either :)

Last edited on

Ganado (6814)

Right, I should have mentioned that my code only handles the simplest case of <begin delimiter>other text that isn't a delimter<end delimiter>. Doesn't handle nested duplicate tags, check for errors parsing, or handle more complicated things like CDATA sections.
A true XML/HTML parser requires a stack (or equivalent) to keep track of a state. Search "C++ XML parser" for a variety of open-source libraries to choose from.

Last edited on

KareemRj (34)

<REUTERS ... >
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> ... </UNKNOWN>
<TEXT> ...
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE> SALVADOR, Feb 26 - </DATELINE>
<BODY>Showers continued throughout
the week in the Bahia cocoa zone, alleviating the drought since
...
...
Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
Reuter
</BODY></TEXT>
</REUTERS>
this is the File that the code is supposed to read from between the tags <BODY> and </BODY> this is a term project and specific IO libraries have been issued so I just need the simple concept to implement :D. Thanks in advance

KareemRj (34)

Obviously I am not that experienced with programming since I am a new student, but also is there anyway to read from a folder that has multiple files? I know a for loop should be implemented but I still don't know how to implement a for loop through a const string which is the file name. (File name is to be inserted by the user and it should loop through numerically sorted files (i.e file1, file2, file3 ... etc..)

JLBorges (13770)

> the code is supposed to read from between the tags <BODY> and </BODY>

With the standard C++ library, something like this, perhaps:

#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <fstream>

// extract the contents between <BODY> and </BODY> in the multiline string str
std::string extract_body( std::string str )
{
    // this regex library may not support std::regex::multiline
    // work around: replace new lines with ASCII NAK (we assume that str does not contain NAK characters)
    constexpr char NAK = 21 ;
    for( char& c : str ) if( c == '\n' ) c = NAK ;
    static const std::regex body_re( "\\<BODY\\>(.*)\\</BODY\\>" ) ;

    std::smatch match ;
    if( std::regex_search( str, match, body_re ) )
    {
        std::string body( match[1] ) ;
        // restore the new lines and return the result
        for( char& c : body ) if( c == NAK ) c = '\n' ;
        return body ;
    }
    else return {} ; // no match; return an empty string
}

// get the contents of the text file as a string
std::string get_file_text( const std::string& path )
{
    if( std::ifstream file{path} )
        return { std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>{} } ;
    else return {} ; // failed to open file; return an empty string
}

int main()
{
    {
        // create a test file
        std::ofstream( "test.xml" ) <<
R"(<REUTERS ... >
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> ... </UNKNOWN>
<TEXT> ...
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE> SALVADOR, Feb 26 - </DATELINE>
<BODY>Showers continued throughout
the week in the Bahia cocoa zone, alleviating the drought since
...
...
Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
Reuter
&#3;</BODY></TEXT>
</REUTERS>
)" ;
    }

    // extract the body and display it
    const std::string body = extract_body( get_file_text( "test.xml" ) ) ;
    std::cout << body << '\n' ;
}

http://coliru.stacked-crooked.com/a/5242b71ee553daaf

> but also is there anyway to read from a folder that has multiple files?

See: https://en.cppreference.com/w/cpp/filesystem/directory_iterator
There is a small example at the end of the page.

seeplus (6599)

Consider:

#include <iostream>
#include <string>
#include <utility>
#include <fstream>

int main()
{
	const std::string opent {"<BODY>"};
	const std::string closet {"</BODY>"};

	for (const auto& fn : {"reuters.txt"}) {	// Add more file names here as needed
		std::ifstream ifs(fn);

		if (ifs) {
			std::string body;

			for (auto [text, gotbod] {std::pair {std::string{}, false}}; std::getline(ifs, text); )
				for (size_t fnd {}, pos {}; fnd != std::string::npos; )
					if (gotbod)
						if (fnd = text.find(closet, pos); fnd != std::string::npos) {
							gotbod = false;
							body += text.substr(pos, fnd - pos);
							pos += closet.size();
							std::cout << body << '\n';
							body.clear();
						} else
							body += text.substr(pos) + "\n";
					else
						if (fnd = text.find(opent, pos); fnd != std::string::npos) {
							gotbod = true;
							pos = fnd + opent.size();
						}
		} else
			std::cout << "Cannot open file " << fn << '\n';
	}
}

This will work with multiple specified file names.

seeplus (6599)

If you want it to work with file name(s) specified by the user at run-time, then the easiest is to just put them on the command line after the command, then:

#include <iostream>
#include <string>
#include <utility>
#include <fstream>

int main(int argc, char* argv[])
{
	const std::string opent {"<BODY>"};
	const std::string closet {"</BODY>"};

	for (int a = 1; a < argc; ++a) {
		std::ifstream ifs(argv[a]);

		if (ifs) {
			std::string body;

			for (auto [text, gotbod] {std::pair {std::string{}, false}}; std::getline(ifs, text); )
				for (size_t fnd {}, pos {}; fnd != std::string::npos; )
					if (gotbod)
						if (fnd = text.find(closet, pos); fnd != std::string::npos) {
							gotbod = false;
							body += text.substr(pos, fnd - pos);
							pos += closet.size();
							std::cout << body << '\n';
							body.clear();
						} else
							body += text.substr(pos) + "\n";
					else
						if (fnd = text.find(opent, pos); fnd != std::string::npos) {
							gotbod = true;
							pos = fnd + opent.size();
						}
		} else
			std::cout << "Cannot open file " << argv[a] << '\n';
	}
}


reuters.exe myfile1.txt myfile2.txt

If you want to iterate over files in a folder, then you'll need to specify which ones - all, those ending in .txt? .xml ? those starting with file etc etc ??

Last edited on

KareemRj (34)

The thing is I have to read 21 files of this type in a sequence and sort most repetitive words in a chat like format. These 21 files have the format written above and required to be read from "<BODY>" to </BODY>". and a class is required to be implemented and I don't have an idea what to use the class for. I thought about stacks. Is it a good idea?
Thanks.

JLBorges (13770)

> is there anyway to read from a folder that has multiple files?
> (File name is to be inserted by the user and it should loop through
> numerically sorted files (i.e file1, file2, file3 ... etc..)

Something like this (without the class that the career teacher would like to see).

#include <iostream>
#include <string>
#include <fstream>
#include <regex>
#include <iterator>
#include <vector>
#include <filesystem>
#include <algorithm>

std::string extract_body( std::string str )
{
    // this regex library may not support std::regex::multiline
    // work around: replace new lines with ASCII NAK (we assume that str does not contain NAK characters)
    constexpr char NAK = 21 ;
    for( char& c : str ) if( c == '\n' ) c = NAK ;
    static const std::regex body_re( "\\<BODY\\>(.*)\\</BODY\\>" ) ;

    std::smatch match ;
    if( std::regex_search( str, match, body_re ) )
    {
        std::string body( match[1] ) ;
        // restore the new lines and return the result
        for( char& c : body ) if( c == NAK ) c = '\n' ;
        return body ;
    }
    else return {} ; // no match; return an empty string
}

// get the contents of the text file as a string
std::string get_file_text( const std::string& path )
{
    if( std::ifstream file{path} )
        return { std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>{} } ;
    else return {} ; // failed to open file; return an empty string
}

// get a list of regular files in the directory with names like file_name1, file_name2, file_name3 etc.
std::vector<std::string> get_matching_file_names( const std::string& directory, const std::string& file_name_base )
{
    std::vector<std::string> result ;

    try
    {
        namespace file_sys = std::filesystem ;
        for( const auto& de : file_sys::directory_iterator(directory) )
        {
             // file_name_base followed by one or more decimal digits
             // for brevity, this code assume that file_name_base does not contain special regex characters like [ etc.
             static const std::regex file_name_re( file_name_base + "\\d+" ) ;

             const std::string file_name = de.path().filename().string() ;
             if( file_sys::is_regular_file( de.path() ) && std::regex_match( file_name, file_name_re ) ) // if the pattern matches
                result.push_back(file_name) ; // add to result
        }
    }
    catch( const std::exception& ) {} // directory iteration failed, return the empty reult

    return result ;
}

int main()
{
    // 1. get the file name base and directory name from the user
    std::string file_name_base ;
    std::cout << "file name base: " ;
    std::cin >> file_name_base ;

    std::string dir_name ;
    std::cout << "in directory: " ;
    std::cin >> dir_name ;

    // 2. get the list of files which match the pattern
    auto file_names = get_matching_file_names( dir_name, file_name_base ) ;


    // 3a. helper to retrieve the number at the end of a (valid expected) file name
    const auto number = [] ( const std::string& file_name )
    {
        const auto pos = file_name.find_last_not_of( "0123456789" ) ;
        return std::stoi( file_name.substr( pos == std::string::npos ? 0 : pos+1 ) ) ;
    };

    // 3b. sort the file names on the number at the end of the file name (note: this is not the most efficient)
    std::sort( file_names.begin(), file_names.end(),
               [&number]( const auto& a, const auto& b ) { return number(a) < number(b) ; } ) ;

    // 4. print the content between the tags <BODY> and </BODY> in each of the files
    for( const auto& fname : file_names )
    {
        const auto body = extract_body( get_file_text(fname) ) ;
        if( !body.empty() ) std::cout << "file: " << fname << "\nbody: " << body << "\n\n" ;
    }
}

KareemRj (34)

JLBorges thanks, but I am using visual studio 2019 and it is giving me filesystem should have a namespace, and namespace requires identifier (I am only on my second year of computer engineering so I apologize for any stupidity coming out of me.)

Ganado (6814)

Try changing your compiler flag to use the 'latest' version.
/std:c++latest
https://docs.microsoft.com/en-us/cpp/build/reference/std-specify-language-standard-version?view=msvc-160

deleted account xyzzy (5768)

I am using visual studio 2019

VS 2019 defaults to using C++14 if you don't manually set the language standard. You have to change it to either C++17 or latest to get JLBorges' example to compile.

I use VS 2019 also, with /std::C++17 set, and the example code compiles without any errors or warnings for me.

There is a way to change the default settings so every new C++ solution/project you create defaults to C++17 or later if you want.

http://www.cplusplus.com/forum/lounge/271176/

The info in that post sure helped me, now any new C++ project I start "defaults" to C++17 instead of C++14.

KareemRj (34)

For the read file it is a 1000+ char file, and I only can use pointers and arrays to read from the file no other libraries are accepted such as vectors and regex and those libraries. I actually am not that good with pointers. Help is needed :) Thank you

lastchance (6980)

If your files are called (say)
file1 file2 file3 file4 ...
then running
progname.exe file*
from the Windows command line with the data files in that directory will have the operating system automatically expand this as
progname.exe file1 file2 file3 file4 ...
and you can use the

int main( int argc, char **argv )

form to pick off your filenames in argv[] without having to do anything with the filesystem in c++.

For example, if test.cpp is

#include <iostream>
using namespace std;

int main( int argc, char **argv )
{
   for ( int i = 1; i < argc; i++ ) cout << argv[i] << '\n';
}

then compiling as test.exe and running from the command line as
test *.cpp
will list all .cpp files in my current folder. (Pretty well the same as dir *.cpp would do, I know.)

What you do with the processing of each file depends on what "I only can use pointers and arrays" means.

Last edited on

thmm (703)

@lastchance,
it doesn't work on Windows 7.

argc == 2 and argv[1] = file*.txt

Ganado (6814)

I don't think it's the operating system, it depends on which runtime is used. But I don't know what the default is for VS2019, or if it can be changed.

Anyway, if our code doesn't help you, then post your own attempt and then we can help you with that. Otherwise, this thread is a waste of time.

Last edited on

seeplus (6599)

Yep - the Windows command parser doesn't know you're specifying a partial file name. Linux shell expands wildcard chars but Windows doesn't (neither 7 nor 10).

lastchance (6980)

seeplus wrote:
Yep - the Windows command parser doesn't know you're specifying a partial file name. Linux shell expands wildcard chars but Windows doesn't (neither 7 nor 10).

EDIT. It works if I compile it with g++, not if I compile it with cl.exe. No idea what the difference is when the command is issued.

Open a standard command prompt (no, nothing to do with Visual Studio). Navigate to the relevant folder and with the code above:
test.exe *.cpp

It works perfectly well on all three machines I've tried it on - one Windows 7, two Windows 10. Lists all .cpp files.

Output from my current junk folder (looks like most of these aren't going to end up in my cpp archive!):

ana.cpp
particleSwarmOptimisation.cpp
test.cpp
test2.cpp
test3.cpp
try.cpp

Last edited on

Pages: 12