Parsing text

Jan 30, 2014 at 11:09am

How do you parse text efficiently.

I looked up how to do it and all of the ways i found seemed quite long winded and specific to certain text

Say i wanted to get 'hello' out of this line

<baseQueueName>Hello</baseQueueName>

Jan 30, 2014 at 11:38am

TwilightSpectre (1392)

Parsing text is entirely down to what you want to get out of the text. Different formats of data will require different parsing, there is no 'one size fits all' with parsing.

As for your example there, you could use an istream_iterator to check each character, setting values based on the values found. For example, when you find a '<' sign, you iterate to the '>' and store the key in, say, a std::map (not ideal, but sufficient for your example). You then parse all the data you come across until you reach the cancellation for your key ('<' '/' "key" '>'). Hope this helps you get started!

Jan 30, 2014 at 1:30pm

jidder (139)

Would you be able to post the code to do this as an example please ?

Thanks

Jan 30, 2014 at 4:07pm

Yanson (885)

as NT3 said there are many ways to do this

#include <iostream>
#include <string>
#include <sstream>
using namespace std;

int main()
{
    string parse = "<baseQueueName>Hello</baseQueueName>";
    stringstream ss(parse);
    string temp = "";

    getline(ss, temp, '>');
    getline(ss, temp, '<');

    cout << "temp: " << temp << endl;

    cin.ignore();
    return 0;
}

Edit & run on cpp.sh

Last edited on Jan 30, 2014 at 4:10pm

Jan 30, 2014 at 4:56pm

JLBorges (13770)

> Would you be able to post the code to do this as an example please?

Here is an example using the regular expression library.
(Learn regex, you will be richly rewarded for your effort. Many times over).

#include <iostream>
#include <string>
#include <regex>
#include <utility>

// extract tag and value from <tag>value</tag> and return a pair of strings{tag,value}
std::pair<std::string,std::string> extract( const std::string& text )
{
    // a very simple regex 
    
    // "\S+" => non-space character, repeated one or more times
    // "\s*" => space repeated zero or more times
    // "/\1" => a slash, followed by the first subgroup that was captured 
    //    ie. if the first subgroup "(\S+)"was "Name", then "/Name"
    
    // match "<tag>text</tag> "<(\S+)>(\S+)</\1>" 
    // with zero or more spaces (\s*) allowed between components
    std::regex re( R"#(\s*<\s*(\S+)\s*>\s*(\S+)\s*<\s*/\1\s*>\s*)#"  ) ; // raw string literal
    std::smatch match_results ;
    
    if( std::regex_match( text, match_results, re ) && match_results.size() == 3 )
        return { match_results[1].str(), match_results[2].str() } ;
        // the first subgroup catured is the tag, the second one is the value
    
    return {} ; // failed
}


int main()
{
    const std::string text[] = { "<baseQueueName>Hello</baseQueueName>", 
                                 "< baseQueueName > Hello < /baseQueueName > ",
                                 "<amount>123.45</amount>", "  <  amount  >  123.45  </amount  >  ",
                                 "<baseQueueName>Hello</QueueName>", "<amount>123.45</Name>"};   
                                 
    for( const std::string& txt : text )
    {
         const auto pair = extract(txt) ;
         static const char quote = '"' ;
         std::cout << "text: " << quote << txt << quote ;
         if( !pair.first.empty() )
         {
            std::cout << "    tag: " << quote << pair.first << quote  
                      << "    value: " << quote << pair.second << quote << "\n\n" ;
         }
         else std::cerr << "    **** badly formed string ****\n\n" ;
    }
}

Edit & run on cpp.sh

http://coliru.stacked-crooked.com/a/619a66c9259a64dd

Jan 31, 2014 at 9:12am

jidder (139)

Thanks for the code guys

Ill definitely look into learning regex as well :)

Jan 31, 2014 at 1:45pm

jidder (139)

Yanson

Could you explain how the code you posted works please ?

Thanks

Jan 31, 2014 at 2:30pm

Yanson (885)

http://www.cplusplus.com/reference/sstream/stringstream/?kw=stringstream
http://www.cplusplus.com/reference/string/string/getline/?kw=getline

stringstream ss(parse); This part of my code loads the string parse into a stringstream I declare as ss. If your not familiar with stringstreams you can think of ss as being equal to cin. I use a stringstream mostly so I can use getline to extract the data, which seemed the easiest way.

getline(ss, temp, '>'); getline extracts all characters from ss, up to and including the '>'. This string of characters is stored in temp, except for the '>' which is discarded. So temp at this point is equal to <baseQueueName. We don't want this so we do nothing with it.

getline(ss, temp, '<'); Here I use getline a second time to extract all characters up to and including the '<'. These characters are stored in temp overwriting the previous string. The '<' is discarded, temp now equals Hello

Last edited on Jan 31, 2014 at 2:41pm

Jan 31, 2014 at 2:52pm

JLBorges (13770)

To parse the text correctly, we need to do a bit more than the posted code.

Something like this (the code would be a lot shorter, and won't contain the ugly deeply-nested ifs, if the running commentary explaining the code is not needed):

#include <iostream>
#include <string>
#include <sstream>

// return trimmed (leading and trailing spaces removed) string
std::string trim( const std::string& str )
{
    static const std::string ws = " \t\n" ;
    auto first = str.find_first_not_of(ws) ;
    auto last = str.find_last_not_of(ws) ;
    return first == std::string::npos ? "" : str.substr( first, last-first+1 ) ;
}

int main()
{
    const std::string text = "    < baseQueueName >  Hello  < /baseQueueName >  " ;
    std::istringstream stm(text) ;

    std::string left_tag ;
    std::string value ;
    std::string right_tag ;

    // step 1. read the first non whitespace character
    // if text is well formed, the character should be '<'
    char ch ;
    if( stm >> ch && ch == '<' )
    {
        std::cout << "step 1 ok. got: <\n" ;
        // step 2. read everything from there on till a '>' is found
        //         as the tag on the left;  discard the '>'
        // http://www.cplusplus.com/reference/string/string/getline/
        if( std::getline( stm, left_tag, '>' ) )
        {
            std::cout << "step 2 ok. got left_tag: " << left_tag << '\n' ;
            // step 3. read the value uto the next '<', discard the '<'
            if( std::getline( stm, value, '<' ) )
            {
                std::cout << "step 3 ok. got value: " << value << '\n' ;
                // step 4. read and discard the next '/' skipping over white space
                if( stm >> ch && ch == '/')
                {
                    std::cout << "step 4 ok. got: /\n" ;
                    // step 5. read everything from there on till a '>' is found
                    //         as the tag on the right
                    if( std::getline( stm, right_tag, '>' ) && !stm.eof() )
                    {
                        std::cout << "step 5 ok. got right_tag: " << right_tag << '\n' ;

                        // step 6. we have got the raw data now; sanitize it
                        left_tag = trim(left_tag) ;
                        value = trim(value) ;
                        right_tag = trim(right_tag) ;
                        std::cout << "step 6 trim ok.\n\t left tag: " << left_tag
                                  << "\n\tvalue: " << value
                                  << "\n\tright tag: " << right_tag << '\n' ;
                        // step 7. verify that the tags match
                        if( left_tag == right_tag )
                        {
                            std::cout << "step 7 ok. the left and right tags match\n" ;
                            std::cout << "--------------------------\n *** success ***\n"
                                      << "tag: '" << left_tag << "'\n"
                                      << "value: '" << value << "'\n"
                                      << "------------------------\n" ;
                        }
                    }
                }
            }
        }
    }
}

Edit & run on cpp.sh

http://coliru.stacked-crooked.com/a/0985cb779a12a7d1

Jan 31, 2014 at 2:54pm

jidder (139)

I already (well i think i do) know how stringstreams work and getline work.

Is the string removed from ss then ?

Because after the first step if i check the string in ss using

cout << ss.str();

it still shows the whole

<baseQueueName>Hello</QueueName>

also if i remove

getline(ss, temp, '>');

then it outputs

Temp:

so it appears as though temp is still empty even though

getline(ss, temp, '<'); is still there

If you could shed some light on these i would appreciate it

Thanks for the help up to now though anyway

Jan 31, 2014 at 3:38pm

Yanson (885)

Is the string removed from ss then ?

no the current character position is just updated
http://www.cplusplus.com/reference/istream/istream/tellg/

#include <iostream>
#include <string>
#include <sstream>
using namespace std;

int main()
{
    string parse = "<baseQueueName>Hello</baseQueueName>";
    stringstream ss(parse);
    string temp = "";

    cout << "current position " << ss.tellg() << endl;
    getline(ss, temp, '>');
    cout << "current position " << ss.tellg() << endl;
    getline(ss, temp, '<');

    cout << "temp: " << temp << endl;

    cin.ignore();
    return 0;
}

Edit & run on cpp.sh

also if i remove getline(ss, temp, '>'); then it outputs Temp:

you have to look at how the string is structured <baseQueueName>Hello</baseQueueName>
the first character in the string is '<' so if you use getline like this getline(ss, temp, '<'); getline will see that the first character is a '<' this is the character getline is looking for so it extracts the '<' and discards it. No characters are stored in temp because there were no characters before the '<'

Last edited on Jan 31, 2014 at 3:39pm

Jan 31, 2014 at 3:40pm

jidder (139)

getline will see that the first character is a '<' this is the character getline is looking for so it extracts the '<' and discards it. No characters are stored in temp because there were no characters before the '<'

Oh yeah ha sorry that was stupid.

Anyway i think it was the character position updating that caught me out, never come across that before, but im pretty sure i understand it now :)

Thanks for all the help

Topic archived. No new replies allowed.