Parsing text is entirely down to what you want to get out of the text. Different formats of data will require different parsing, there is no 'one size fits all' with parsing.
As for your example there, you could use an istream_iterator to check each character, setting values based on the values found. For example, when you find a '<' sign, you iterate to the '>' and store the key in, say, a std::map (not ideal, but sufficient for your example). You then parse all the data you come across until you reach the cancellation for your key ('<' '/' "key" '>'). Hope this helps you get started!
#include <iostream>
#include <string>
#include <regex>
#include <utility>
// extract tag and value from <tag>value</tag> and return a pair of strings{tag,value}
std::pair<std::string,std::string> extract( const std::string& text )
{
// a very simple regex
// "\S+" => non-space character, repeated one or more times
// "\s*" => space repeated zero or more times
// "/\1" => a slash, followed by the first subgroup that was captured
// ie. if the first subgroup "(\S+)"was "Name", then "/Name"
// match "<tag>text</tag> "<(\S+)>(\S+)</\1>"
// with zero or more spaces (\s*) allowed between components
std::regex re( R"#(\s*<\s*(\S+)\s*>\s*(\S+)\s*<\s*/\1\s*>\s*)#" ) ; // raw string literal
std::smatch match_results ;
if( std::regex_match( text, match_results, re ) && match_results.size() == 3 )
return { match_results[1].str(), match_results[2].str() } ;
// the first subgroup catured is the tag, the second one is the value
return {} ; // failed
}
int main()
{
const std::string text[] = { "<baseQueueName>Hello</baseQueueName>",
"< baseQueueName > Hello < /baseQueueName > ",
"<amount>123.45</amount>", " < amount > 123.45 </amount > ",
"<baseQueueName>Hello</QueueName>", "<amount>123.45</Name>"};
for( const std::string& txt : text )
{
constauto pair = extract(txt) ;
staticconstchar quote = '"' ;
std::cout << "text: " << quote << txt << quote ;
if( !pair.first.empty() )
{
std::cout << " tag: " << quote << pair.first << quote
<< " value: " << quote << pair.second << quote << "\n\n" ;
}
else std::cerr << " **** badly formed string ****\n\n" ;
}
}
stringstream ss(parse); This part of my code loads the string parse into a stringstream I declare as ss. If your not familiar with stringstreams you can think of ss as being equal to cin. I use a stringstream mostly so I can use getline to extract the data, which seemed the easiest way.
getline(ss, temp, '>'); getline extracts all characters from ss, up to and including the '>'. This string of characters is stored in temp, except for the '>' which is discarded. So temp at this point is equal to <baseQueueName. We don't want this so we do nothing with it.
getline(ss, temp, '<'); Here I use getline a second time to extract all characters up to and including the '<'. These characters are stored in temp overwriting the previous string. The '<' is discarded, temp now equals Hello
To parse the text correctly, we need to do a bit more than the posted code.
Something like this (the code would be a lot shorter, and won't contain the ugly deeply-nested ifs, if the running commentary explaining the code is not needed):
#include <iostream>
#include <string>
#include <sstream>
// return trimmed (leading and trailing spaces removed) string
std::string trim( const std::string& str )
{
staticconst std::string ws = " \t\n" ;
auto first = str.find_first_not_of(ws) ;
auto last = str.find_last_not_of(ws) ;
return first == std::string::npos ? "" : str.substr( first, last-first+1 ) ;
}
int main()
{
const std::string text = " < baseQueueName > Hello < /baseQueueName > " ;
std::istringstream stm(text) ;
std::string left_tag ;
std::string value ;
std::string right_tag ;
// step 1. read the first non whitespace character
// if text is well formed, the character should be '<'
char ch ;
if( stm >> ch && ch == '<' )
{
std::cout << "step 1 ok. got: <\n" ;
// step 2. read everything from there on till a '>' is found
// as the tag on the left; discard the '>'
// http://www.cplusplus.com/reference/string/string/getline/if( std::getline( stm, left_tag, '>' ) )
{
std::cout << "step 2 ok. got left_tag: " << left_tag << '\n' ;
// step 3. read the value uto the next '<', discard the '<'
if( std::getline( stm, value, '<' ) )
{
std::cout << "step 3 ok. got value: " << value << '\n' ;
// step 4. read and discard the next '/' skipping over white space
if( stm >> ch && ch == '/')
{
std::cout << "step 4 ok. got: /\n" ;
// step 5. read everything from there on till a '>' is found
// as the tag on the right
if( std::getline( stm, right_tag, '>' ) && !stm.eof() )
{
std::cout << "step 5 ok. got right_tag: " << right_tag << '\n' ;
// step 6. we have got the raw data now; sanitize it
left_tag = trim(left_tag) ;
value = trim(value) ;
right_tag = trim(right_tag) ;
std::cout << "step 6 trim ok.\n\t left tag: " << left_tag
<< "\n\tvalue: " << value
<< "\n\tright tag: " << right_tag << '\n' ;
// step 7. verify that the tags match
if( left_tag == right_tag )
{
std::cout << "step 7 ok. the left and right tags match\n" ;
std::cout << "--------------------------\n *** success ***\n"
<< "tag: '" << left_tag << "'\n"
<< "value: '" << value << "'\n"
<< "------------------------\n" ;
}
}
}
}
}
}
}
also if i remove getline(ss, temp, '>'); then it outputs Temp:
you have to look at how the string is structured <baseQueueName>Hello</baseQueueName>
the first character in the string is '<' so if you use getline like this getline(ss, temp, '<'); getline will see that the first character is a '<' this is the character getline is looking for so it extracts the '<' and discards it. No characters are stored in temp because there were no characters before the '<'
getline will see that the first character is a '<' this is the character getline is looking for so it extracts the '<' and discards it. No characters are stored in temp because there were no characters before the '<'
Oh yeah ha sorry that was stupid.
Anyway i think it was the character position updating that caught me out, never come across that before, but im pretty sure i understand it now :)