Hello, I've been trying to read from a .sgm file a certain document. But it has tags and characters like "<" and ">" which should not be printed onto the console and I still haven't figured out how to remove them from the text I am trying to read in the file. Any insight?
Basically the article that the code is supposed to be read is tied in between 2 strings that are "<BODY" and "</BODY>". The code has to read between those 2 strings.
this is a problem that can be huge or trivial.
if you are doing xml or html type work, you may want a full answer, not a partial one, that can handle many tags, nested tags, and much complexity -- there are libraries to handle these common formats.
If you are dealing with a custom/hand rolled / unpopular format, and can't find a tool, but need to deal with complexity, you can use c++'s regular expressions and your own logic to unravel the tags. Typically it will work similar to a code indenter or bracket checker -- you use a stack to track your tags, and if the stack has left-overs they were mismatched.
If it is a simple file with a couple of tags and nothing too complicated, raw string processing is very fast and easy and can do the job. This is what the example above does, and if your problem description is the real problem in full, it is the right way to do it.
Your title is odd, what about OOP were you thinking? You can use an object to parse, but it isn't necessary (C++ allows code without objects). I can't think of a reason to make an object here, but there is no reason not to, either :)
Right, I should have mentioned that my code only handles the simplest case of <begin delimiter>other text that isn't a delimter<end delimiter>. Doesn't handle nested duplicate tags, check for errors parsing, or handle more complicated things like CDATA sections.
A true XML/HTML parser requires a stack (or equivalent) to keep track of a state. Search "C++ XML parser" for a variety of open-source libraries to choose from.
<REUTERS ... >
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> ... </UNKNOWN>
<TEXT> ...
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE> SALVADOR, Feb 26 - </DATELINE>
<BODY>Showers continued throughout
the week in the Bahia cocoa zone, alleviating the drought since
...
...
Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
Reuter
</BODY></TEXT>
</REUTERS>
this is the File that the code is supposed to read from between the tags <BODY> and </BODY> this is a term project and specific IO libraries have been issued so I just need the simple concept to implement :D. Thanks in advance
Obviously I am not that experienced with programming since I am a new student, but also is there anyway to read from a folder that has multiple files? I know a for loop should be implemented but I still don't know how to implement a for loop through a const string which is the file name. (File name is to be inserted by the user and it should loop through numerically sorted files (i.e file1, file2, file3 ... etc..)
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <fstream>
// extract the contents between <BODY> and </BODY> in the multiline string str
std::string extract_body( std::string str )
{
// this regex library may not support std::regex::multiline
// work around: replace new lines with ASCII NAK (we assume that str does not contain NAK characters)
constexprchar NAK = 21 ;
for( char& c : str ) if( c == '\n' ) c = NAK ;
staticconst std::regex body_re( "\\<BODY\\>(.*)\\</BODY\\>" ) ;
std::smatch match ;
if( std::regex_search( str, match, body_re ) )
{
std::string body( match[1] ) ;
// restore the new lines and return the result
for( char& c : body ) if( c == NAK ) c = '\n' ;
return body ;
}
elsereturn {} ; // no match; return an empty string
}
// get the contents of the text file as a string
std::string get_file_text( const std::string& path )
{
if( std::ifstream file{path} )
return { std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>{} } ;
elsereturn {} ; // failed to open file; return an empty string
}int main()
{
{
// create a test file
std::ofstream( "test.xml" ) <<
R"(<REUTERS ... >
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> ... </UNKNOWN>
<TEXT> ...
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE> SALVADOR, Feb 26 - </DATELINE>
<BODY>Showers continued throughout
the week in the Bahia cocoa zone, alleviating the drought since
...
...
Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
Reuter
</BODY></TEXT>
</REUTERS>
)" ;
}
// extract the body and display it
const std::string body = extract_body( get_file_text( "test.xml" ) ) ;
std::cout << body << '\n' ;
}
If you want it to work with file name(s) specified by the user at run-time, then the easiest is to just put them on the command line after the command, then:
#include <iostream>
#include <string>
#include <utility>
#include <fstream>
int main(int argc, char* argv[])
{
const std::string opent {"<BODY>"};
const std::string closet {"</BODY>"};
for (int a = 1; a < argc; ++a) {
std::ifstream ifs(argv[a]);
if (ifs) {
std::string body;
for (auto [text, gotbod] {std::pair {std::string{}, false}}; std::getline(ifs, text); )
for (size_t fnd {}, pos {}; fnd != std::string::npos; )
if (gotbod)
if (fnd = text.find(closet, pos); fnd != std::string::npos) {
gotbod = false;
body += text.substr(pos, fnd - pos);
pos += closet.size();
std::cout << body << '\n';
body.clear();
} else
body += text.substr(pos) + "\n";
elseif (fnd = text.find(opent, pos); fnd != std::string::npos) {
gotbod = true;
pos = fnd + opent.size();
}
} else
std::cout << "Cannot open file " << argv[a] << '\n';
}
}
eg
reuters.exe myfile1.txt myfile2.txt
If you want to iterate over files in a folder, then you'll need to specify which ones - all, those ending in .txt? .xml ? those starting with file etc etc ??
The thing is I have to read 21 files of this type in a sequence and sort most repetitive words in a chat like format. These 21 files have the format written above and required to be read from "<BODY>" to </BODY>". and a class is required to be implemented and I don't have an idea what to use the class for. I thought about stacks. Is it a good idea?
Thanks.
> is there anyway to read from a folder that has multiple files?
> (File name is to be inserted by the user and it should loop through
> numerically sorted files (i.e file1, file2, file3 ... etc..)
Something like this (without the class that the career teacher would like to see).
#include <iostream>
#include <string>
#include <fstream>
#include <regex>
#include <iterator>
#include <vector>
#include <filesystem>
#include <algorithm>
std::string extract_body( std::string str )
{
// this regex library may not support std::regex::multiline
// work around: replace new lines with ASCII NAK (we assume that str does not contain NAK characters)
constexprchar NAK = 21 ;
for( char& c : str ) if( c == '\n' ) c = NAK ;
staticconst std::regex body_re( "\\<BODY\\>(.*)\\</BODY\\>" ) ;
std::smatch match ;
if( std::regex_search( str, match, body_re ) )
{
std::string body( match[1] ) ;
// restore the new lines and return the result
for( char& c : body ) if( c == NAK ) c = '\n' ;
return body ;
}
elsereturn {} ; // no match; return an empty string
}
// get the contents of the text file as a string
std::string get_file_text( const std::string& path )
{
if( std::ifstream file{path} )
return { std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>{} } ;
elsereturn {} ; // failed to open file; return an empty string
}
// get a list of regular files in the directory with names like file_name1, file_name2, file_name3 etc.
std::vector<std::string> get_matching_file_names( const std::string& directory, const std::string& file_name_base )
{
std::vector<std::string> result ;
try
{
namespace file_sys = std::filesystem ;
for( constauto& de : file_sys::directory_iterator(directory) )
{
// file_name_base followed by one or more decimal digits
// for brevity, this code assume that file_name_base does not contain special regex characters like [ etc.
staticconst std::regex file_name_re( file_name_base + "\\d+" ) ;
const std::string file_name = de.path().filename().string() ;
if( file_sys::is_regular_file( de.path() ) && std::regex_match( file_name, file_name_re ) ) // if the pattern matches
result.push_back(file_name) ; // add to result
}
}
catch( const std::exception& ) {} // directory iteration failed, return the empty reult
return result ;
}
int main()
{
// 1. get the file name base and directory name from the user
std::string file_name_base ;
std::cout << "file name base: " ;
std::cin >> file_name_base ;
std::string dir_name ;
std::cout << "in directory: " ;
std::cin >> dir_name ;
// 2. get the list of files which match the pattern
auto file_names = get_matching_file_names( dir_name, file_name_base ) ;
// 3a. helper to retrieve the number at the end of a (valid expected) file name
constauto number = [] ( const std::string& file_name )
{
constauto pos = file_name.find_last_not_of( "0123456789" ) ;
return std::stoi( file_name.substr( pos == std::string::npos ? 0 : pos+1 ) ) ;
};
// 3b. sort the file names on the number at the end of the file name (note: this is not the most efficient)
std::sort( file_names.begin(), file_names.end(),
[&number]( constauto& a, constauto& b ) { return number(a) < number(b) ; } ) ;
// 4. print the content between the tags <BODY> and </BODY> in each of the files
for( constauto& fname : file_names )
{
constauto body = extract_body( get_file_text(fname) ) ;
if( !body.empty() ) std::cout << "file: " << fname << "\nbody: " << body << "\n\n" ;
}
}
JLBorges thanks, but I am using visual studio 2019 and it is giving me filesystem should have a namespace, and namespace requires identifier (I am only on my second year of computer engineering so I apologize for any stupidity coming out of me.)
VS 2019 defaults to using C++14 if you don't manually set the language standard. You have to change it to either C++17 or latest to get JLBorges' example to compile.
I use VS 2019 also, with /std::C++17 set, and the example code compiles without any errors or warnings for me.
There is a way to change the default settings so every new C++ solution/project you create defaults to C++17 or later if you want.
For the read file it is a 1000+ char file, and I only can use pointers and arrays to read from the file no other libraries are accepted such as vectors and regex and those libraries. I actually am not that good with pointers. Help is needed :) Thank you
If your files are called (say)
file1 file2 file3 file4 ...
then running progname.exe file*
from the Windows command line with the data files in that directory will have the operating system automatically expand this as progname.exe file1 file2 file3 file4 ...
and you can use the
int main( int argc, char **argv )
form to pick off your filenames in argv[] without having to do anything with the filesystem in c++.
For example, if test.cpp is
1 2 3 4 5 6 7
#include <iostream>
usingnamespace std;
int main( int argc, char **argv )
{
for ( int i = 1; i < argc; i++ ) cout << argv[i] << '\n';
}
then compiling as test.exe and running from the command line as test *.cpp
will list all .cpp files in my current folder. (Pretty well the same as dir *.cpp would do, I know.)
What you do with the processing of each file depends on what "I only can use pointers and arrays" means.
I don't think it's the operating system, it depends on which runtime is used. But I don't know what the default is for VS2019, or if it can be changed.
Anyway, if our code doesn't help you, then post your own attempt and then we can help you with that. Otherwise, this thread is a waste of time.
Yep - the Windows command parser doesn't know you're specifying a partial file name. Linux shell expands wildcard chars but Windows doesn't (neither 7 nor 10).
Yep - the Windows command parser doesn't know you're specifying a partial file name. Linux shell expands wildcard chars but Windows doesn't (neither 7 nor 10).
EDIT. It works if I compile it with g++, not if I compile it with cl.exe. No idea what the difference is when the command is issued.
Open a standard command prompt (no, nothing to do with Visual Studio). Navigate to the relevant folder and with the code above: test.exe *.cpp
It works perfectly well on all three machines I've tried it on - one Windows 7, two Windows 10. Lists all .cpp files.
Output from my current junk folder (looks like most of these aren't going to end up in my cpp archive!):