Parsing text

Jan 30, 2014 at 11:09am
How do you parse text efficiently.

I looked up how to do it and all of the ways i found seemed quite long winded and specific to certain text

Say i wanted to get 'hello' out of this line

 
<baseQueueName>Hello</baseQueueName>
Jan 30, 2014 at 11:38am
Parsing text is entirely down to what you want to get out of the text. Different formats of data will require different parsing, there is no 'one size fits all' with parsing.

As for your example there, you could use an istream_iterator to check each character, setting values based on the values found. For example, when you find a '<' sign, you iterate to the '>' and store the key in, say, a std::map (not ideal, but sufficient for your example). You then parse all the data you come across until you reach the cancellation for your key ('<' '/' "key" '>'). Hope this helps you get started!
Jan 30, 2014 at 1:30pm
Would you be able to post the code to do this as an example please ?

Thanks
Jan 30, 2014 at 4:07pm
as NT3 said there are many ways to do this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <iostream>
#include <string>
#include <sstream>
using namespace std;

int main()
{
    string parse = "<baseQueueName>Hello</baseQueueName>";
    stringstream ss(parse);
    string temp = "";

    getline(ss, temp, '>');
    getline(ss, temp, '<');

    cout << "temp: " << temp << endl;

    cin.ignore();
    return 0;
}
Last edited on Jan 30, 2014 at 4:10pm
Jan 30, 2014 at 4:56pm
> Would you be able to post the code to do this as an example please?

Here is an example using the regular expression library.
(Learn regex, you will be richly rewarded for your effort. Many times over).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <iostream>
#include <string>
#include <regex>
#include <utility>

// extract tag and value from <tag>value</tag> and return a pair of strings{tag,value}
std::pair<std::string,std::string> extract( const std::string& text )
{
    // a very simple regex 
    
    // "\S+" => non-space character, repeated one or more times
    // "\s*" => space repeated zero or more times
    // "/\1" => a slash, followed by the first subgroup that was captured 
    //    ie. if the first subgroup "(\S+)"was "Name", then "/Name"
    
    // match "<tag>text</tag> "<(\S+)>(\S+)</\1>" 
    // with zero or more spaces (\s*) allowed between components
    std::regex re( R"#(\s*<\s*(\S+)\s*>\s*(\S+)\s*<\s*/\1\s*>\s*)#"  ) ; // raw string literal
    std::smatch match_results ;
    
    if( std::regex_match( text, match_results, re ) && match_results.size() == 3 )
        return { match_results[1].str(), match_results[2].str() } ;
        // the first subgroup catured is the tag, the second one is the value
    
    return {} ; // failed
}


int main()
{
    const std::string text[] = { "<baseQueueName>Hello</baseQueueName>", 
                                 "< baseQueueName > Hello < /baseQueueName > ",
                                 "<amount>123.45</amount>", "  <  amount  >  123.45  </amount  >  ",
                                 "<baseQueueName>Hello</QueueName>", "<amount>123.45</Name>"};   
                                 
    for( const std::string& txt : text )
    {
         const auto pair = extract(txt) ;
         static const char quote = '"' ;
         std::cout << "text: " << quote << txt << quote ;
         if( !pair.first.empty() )
         {
            std::cout << "    tag: " << quote << pair.first << quote  
                      << "    value: " << quote << pair.second << quote << "\n\n" ;
         }
         else std::cerr << "    **** badly formed string ****\n\n" ;
    }
}

http://coliru.stacked-crooked.com/a/619a66c9259a64dd
Jan 31, 2014 at 9:12am
Thanks for the code guys

Ill definitely look into learning regex as well :)
Jan 31, 2014 at 1:45pm
Yanson

Could you explain how the code you posted works please ?

Thanks
Jan 31, 2014 at 2:30pm
http://www.cplusplus.com/reference/sstream/stringstream/?kw=stringstream
http://www.cplusplus.com/reference/string/string/getline/?kw=getline

stringstream ss(parse); This part of my code loads the string parse into a stringstream I declare as ss. If your not familiar with stringstreams you can think of ss as being equal to cin. I use a stringstream mostly so I can use getline to extract the data, which seemed the easiest way.

getline(ss, temp, '>'); getline extracts all characters from ss, up to and including the '>'. This string of characters is stored in temp, except for the '>' which is discarded. So temp at this point is equal to <baseQueueName. We don't want this so we do nothing with it.

getline(ss, temp, '<'); Here I use getline a second time to extract all characters up to and including the '<'. These characters are stored in temp overwriting the previous string. The '<' is discarded, temp now equals Hello
Last edited on Jan 31, 2014 at 2:41pm
Jan 31, 2014 at 2:52pm
To parse the text correctly, we need to do a bit more than the posted code.

Something like this (the code would be a lot shorter, and won't contain the ugly deeply-nested ifs, if the running commentary explaining the code is not needed):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#include <iostream>
#include <string>
#include <sstream>

// return trimmed (leading and trailing spaces removed) string
std::string trim( const std::string& str )
{
    static const std::string ws = " \t\n" ;
    auto first = str.find_first_not_of(ws) ;
    auto last = str.find_last_not_of(ws) ;
    return first == std::string::npos ? "" : str.substr( first, last-first+1 ) ;
}

int main()
{
    const std::string text = "    < baseQueueName >  Hello  < /baseQueueName >  " ;
    std::istringstream stm(text) ;

    std::string left_tag ;
    std::string value ;
    std::string right_tag ;

    // step 1. read the first non whitespace character
    // if text is well formed, the character should be '<'
    char ch ;
    if( stm >> ch && ch == '<' )
    {
        std::cout << "step 1 ok. got: <\n" ;
        // step 2. read everything from there on till a '>' is found
        //         as the tag on the left;  discard the '>'
        // http://www.cplusplus.com/reference/string/string/getline/
        if( std::getline( stm, left_tag, '>' ) )
        {
            std::cout << "step 2 ok. got left_tag: " << left_tag << '\n' ;
            // step 3. read the value uto the next '<', discard the '<'
            if( std::getline( stm, value, '<' ) )
            {
                std::cout << "step 3 ok. got value: " << value << '\n' ;
                // step 4. read and discard the next '/' skipping over white space
                if( stm >> ch && ch == '/')
                {
                    std::cout << "step 4 ok. got: /\n" ;
                    // step 5. read everything from there on till a '>' is found
                    //         as the tag on the right
                    if( std::getline( stm, right_tag, '>' ) && !stm.eof() )
                    {
                        std::cout << "step 5 ok. got right_tag: " << right_tag << '\n' ;

                        // step 6. we have got the raw data now; sanitize it
                        left_tag = trim(left_tag) ;
                        value = trim(value) ;
                        right_tag = trim(right_tag) ;
                        std::cout << "step 6 trim ok.\n\t left tag: " << left_tag
                                  << "\n\tvalue: " << value
                                  << "\n\tright tag: " << right_tag << '\n' ;
                        // step 7. verify that the tags match
                        if( left_tag == right_tag )
                        {
                            std::cout << "step 7 ok. the left and right tags match\n" ;
                            std::cout << "--------------------------\n *** success ***\n"
                                      << "tag: '" << left_tag << "'\n"
                                      << "value: '" << value << "'\n"
                                      << "------------------------\n" ;
                        }
                    }
                }
            }
        }
    }
}

http://coliru.stacked-crooked.com/a/0985cb779a12a7d1
Jan 31, 2014 at 2:54pm
I already (well i think i do) know how stringstreams work and getline work.

Is the string removed from ss then ?

Because after the first step if i check the string in ss using

cout << ss.str();

it still shows the whole
<baseQueueName>Hello</QueueName>


also if i remove

getline(ss, temp, '>');

then it outputs

Temp:


so it appears as though temp is still empty even though

getline(ss, temp, '<'); is still there

If you could shed some light on these i would appreciate it

Thanks for the help up to now though anyway
Jan 31, 2014 at 3:38pm
Is the string removed from ss then ?

no the current character position is just updated
http://www.cplusplus.com/reference/istream/istream/tellg/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <iostream>
#include <string>
#include <sstream>
using namespace std;

int main()
{
    string parse = "<baseQueueName>Hello</baseQueueName>";
    stringstream ss(parse);
    string temp = "";

    cout << "current position " << ss.tellg() << endl;
    getline(ss, temp, '>');
    cout << "current position " << ss.tellg() << endl;
    getline(ss, temp, '<');

    cout << "temp: " << temp << endl;

    cin.ignore();
    return 0;
}


also if i remove getline(ss, temp, '>'); then it outputs Temp:

you have to look at how the string is structured <baseQueueName>Hello</baseQueueName>
the first character in the string is '<' so if you use getline like this getline(ss, temp, '<'); getline will see that the first character is a '<' this is the character getline is looking for so it extracts the '<' and discards it. No characters are stored in temp because there were no characters before the '<'
Last edited on Jan 31, 2014 at 3:39pm
Jan 31, 2014 at 3:40pm
getline will see that the first character is a '<' this is the character getline is looking for so it extracts the '<' and discards it. No characters are stored in temp because there were no characters before the '<'

Oh yeah ha sorry that was stupid.

Anyway i think it was the character position updating that caught me out, never come across that before, but im pretty sure i understand it now :)

Thanks for all the help
Topic archived. No new replies allowed.