A qustion about regex search?

Forum

Forum
Beginners
A qustion about regex search?

A qustion about regex search?

hi,i am new here,i have a problem with imitating a behaviour of flex that is returning a token after each call of yylex. i use the following solution.

  ...
  regex reg("pattern1|pattern2|pattern3");
  smatch m;
  string data("...");
  regex_search(data,m,reg);
  ...

now the question is how could i know which pattern matches the first token easily?
and is there any trivial solution to implement such behaviour?
thanks.

Duthomhas (13214)

I'm not sure I understand your question.

You are using a pattern that matches any of three distinct tokens, so the regex will return any one of them.

If you want to know which token matched, you'll have to look at the result of the match.

(I presume you are using Boost.Regex? Just look at m[0].)

Last edited on

JLBorges (13770)

> regex reg("pattern1|pattern2|pattern3") ;

Use marked sub-expressions to capture the match
regex reg( "(pattern1)|(pattern2)|(pattern3)" ) ;
and then see which submatch was returned.

For example:

#include <regex>
#include <string>
#include <iostream>

int main()
{
    const std::regex re( "(a+)|(b+)|(c+)" ) ;

    const std::string& text = "a  bb  aaa  cccc  bbbbb  aaaaaa  ccccccc" ;
    std::cout << text << "\n0123456789012345678901234567890123456789\n\n" ;

    std::smatch match ;
    auto begin = text.begin() ;
    while( std::regex_search( begin, text.end(), match, re ) )
    {
        std::cout << "found \"" << match[0] << "\" starting at " 
                  << begin - text.begin() + match.position() ;

        static const std::string subex[] = { "\"a+\"", "\"b+\"", "\"c+\"" } ;
        for( std::size_t i = 1 ; i < match.size() ; ++i )
        {
            if( match[i].length() > 0 ) 
                std::cout << " - matched subexpression " << subex[i-1] << '\n' ;
        }

        begin = match[0].second ;
    }
}

Output:

a  bb  aaa  cccc  bbbbb  aaaaaa  ccccccc                    
0123456789012345678901234567890123456789                    
                                                            
found "a" starting at 0 - matched subexpression "a+"        
found "bb" starting at 3 - matched subexpression "b+"       
found "aaa" starting at 7 - matched subexpression "a+"      
found "cccc" starting at 12 - matched subexpression "c+"    
found "bbbbb" starting at 18 - matched subexpression "b+"   
found "aaaaaa" starting at 25 - matched subexpression "a+"  
found "ccccccc" starting at 33 - matched subexpression "c+"

immm008 (7)

thanks a lot,but don't you think the complexity is increased by a factor of n, assuming n is the number of patterns?(Are you sure it would compile since you use iterators instead of const_iterators)
i just wonder why the class match_results doesn't supply a member function
telling which pattern is matched using the form of "(pattern1|pattern2|...)".
i think it's trivial,when the implementation of DFA reaches a accept state,it's
easy to find the pattern to which the state belongs.
by the way,since the m.str(i) or m[i].str() could return the lexeme of ith capture group,why there is no trivial way to return the index of the matched capture group?

Last edited on

JLBorges (13770)

> don't you think the complexity is increased by a factor of n, assuming n is the number of patterns?

std::regex_searh() returns one match.
To find all matches, we need to call std::regex_searh() repeatedly till no more matches are found.
The task can be simplified by using a regex iterator; it iterates over the sequence of all matches.

> Are you sure it would compile since you use iterators instead of const_iterators

text is a const std::string, text.begin() returns a const_iterator
With auto begin = text.begin() ; the type of begin is std::string::const_iterator
But yes, the code would be clearer if it were written as auto begin = text.cbegin() ;

Incidentally, just noticed that there is a typo in:
const std::string& text = "a bb aaa cccc bbbbb aaaaaa ccccccc" ;
Should have been:
const std::string text = "a bb aaa cccc bbbbb aaaaaa ccccccc" ;

> i just wonder why the class match_results doesn't supply a member function telling which pattern is matched
> why there is no trivial way to return the index of the matched capture group?

It is trivial to write such a function ourselves:

std::size_t index_of_matched_subexpression( const std::smatch& match ) 
{
    if( match.ready() ) 
        for( std::size_t i = 1 ; i < match.size() ; ++i ) 
            if( match[i].matched ) return i ;

    return std::string::npos ;
}

For instance, using std::regex_search()

#include <regex>
#include <string>
#include <iostream>

std::string::size_type start_position_of_match( const std::string& text, 
                                                std::string::const_iterator search_begin,
                                                const std::smatch& match )
{ 
    if( match.ready() ) return search_begin - text.begin() + match.position() ; 
    else return std::string::npos ;
}

std::size_t index_of_matched_subexpression( const std::smatch& match ) 
{
    if( match.ready() ) 
        for( std::size_t i = 1 ; i < match.size() ; ++i ) 
            if( match[i].matched ) return i ;

    return std::string::npos ;
}

int main()
{
    const std::regex re( "(a+b)|(b+c)|(c+a)" ) ;

    const std::string text = "..ab..bbc..ccca..bbbbc..aaaaab..cccccca" ;
    std::cout << text << "\n0123456789012345678901234567890123456789\n\n" ;

    std::smatch match ;
    auto begin = text.cbegin() ;
    while( std::regex_search( begin, text.end(), match, re ) )
    {
        auto pos_start = start_position_of_match( text, begin, match) ;
        auto subex_index = index_of_matched_subexpression(match) ; 
        std::cout << "found \"" << match[0] << "\" starting at " 
                  <<  pos_start << " matched $" << subex_index << '\n' ;

        begin = match[0].second ;
    }
}

Output:

..ab..bbc..ccca..bbbbc..aaaaab..cccccca    
0123456789012345678901234567890123456789   
                                           
found "ab" starting at 2 matched $1        
found "bbc" starting at 6 matched $2       
found "ccca" starting at 11 matched $3     
found "bbbbc" starting at 17 matched $2    
found "aaaaab" starting at 24 matched $1   
found "cccccca" starting at 32 matched $3

With std::sregex_iterator, the code would be shorter and sweeter:

#include <regex>
#include <string>
#include <iostream>

std::size_t index_of_matched_subexpression( std::sregex_iterator iter ) 
{
    for( std::size_t i = 1 ; i < iter->size() ; ++i ) 
            if( (*iter)[i].matched ) return i ;

    return std::string::npos ;
}

int main()
{
    const std::regex re( "(a+b)|(b+c)|(c+a)" ) ;

    const std::string text = "..ab..bbc..ccca..bbbbc..aaaaab..cccccca" ;
    std::cout << text << "\n0123456789012345678901234567890123456789\n\n" ;

    std::sregex_iterator iter( text.begin(), text.end(), re ) ;
    std::sregex_iterator end ;
    for( ; iter != end ; ++iter )
        std::cout << "found \"" << iter->str() << "\" starting at " << iter->position() 
                  << " matched $" << index_of_matched_subexpression(iter) << '\n' ;
}

Output:

..ab..bbc..ccca..bbbbc..aaaaab..cccccca    
0123456789012345678901234567890123456789   
                                           
found "ab" starting at 2 matched $1        
found "bbc" starting at 6 matched $2       
found "ccca" starting at 11 matched $3     
found "bbbbc" starting at 17 matched $2    
found "aaaaab" starting at 24 matched $1   
found "cccccca" starting at 32 matched $3

andywestken (4094)

@immm008

As you mention flex in your opening post, I assume you're trying to use regular expression to create a lexer. If so, this article might be of interest.

Regular expressions in lexing and parsing
http://commandcenter.blogspot.co.uk/2011/08/regular-expressions-in-lexing-and.html

Andy

Last edited on

immm008 (7)

@JLBorges
i appreciate your help,it's so nice of you!

Topic archived. No new replies allowed.

C++

Forum

A qustion about regex search?