regular expression to catch names

Forum

Forum
General C++ Programming
regular expression to catch names

regular expression to catch names

closed account (Dy7SLyTq)

so im using the re [_|a-z|A-Z]+[_|a-z|A-Z|0-9]+ to find names for my lexer, such as _Name or system or varName34, ie it allows for the same naming conventions as c/c++. however using this makes it so it will find it twice. ie lets say a variable name was Counter, it would find that Counter, and then the same Counter. could someone please tell me what is wrong with my re?

Duthomhas (13292)

I don't know how it could find a single word twice. Are you sure it isn't finding two adjacent occurrences of the same word?

BTW, remember that an RE will glob as much as it can, so you need to give it an opportunity to stop. Also, I'm not sure what the pipes are there for.

[_a-zA-Z][_a-zA-Z0-9]* should work.

If you can, use character classes/bracket expressions:

(_|[:alpha:])\w*

This will work with Unicode text and non-English language identifiers. (Assuming your RE engine can do it.)

Hope this helps.

andywestken (4094)

Your regex seems to behave more or less OK to me; no duplicates. (The more or less is because it found the Ring part of 1Ring.)

Given expression:

"[_|A-Z|a-z]+[_|A-Z|a-z|0-9]+"

And input:

"123 Hello Counter World 456 Example 1Ring _TheEnd"

The output is:

Hello
Counter
World
Example
Ring
_TheEnd

using

#include <iostream>
#include <string>
#include <regex.h>

using std::cout;
using std::endl;
using std::string;

void Test(const string &Pattern, const string& Line);

int main()
{
    Test( "[_|A-Z|a-z]+[_|A-Z|a-z|0-9]+",
          "123 Hello Counter World 456 Example 1Ring _TheEnd" );

    return 0;
}

void Test(const string &Pattern, const string &Line)
{
    regex_t Regex = {0};
    regcomp(&Regex, Pattern.c_str(), REG_EXTENDED);

    regmatch_t Match = {0};
    const char* pch = Line.c_str();
    while(regexec(&Regex, pch, 1, &Match, 0) == 0) {
#ifdef DEBUG
        cout << "Match.rm_so = " << Match.rm_so << endl;
        cout << "Match.rm_eo = " << Match.rm_eo << endl;
#endif
        string Substr(pch + Match.rm_so, Match.rm_eo - Match.rm_so);
        cout << Substr << endl;
        pch += Match.rm_eo;
    }

    regfree(&Regex);
}

Andy

Last edited on

ne555 (10692)

By the way, you ask for strings of at least 2 elements.

closed account (Dy7SLyTq)

@duoas: i was using the pipes for or. i am still relatively new to re's and didnt know that i could do without them. and yeah its copying it because its doing it for words that i know there is only one of.

@everyone: is it my code then? i switched the re to one suggested and its still doing it. could it be something in my code?

     regcomp(&Regex, "[_a-zA-Z][_a-zA-Z0-9]*", REG_EXTENDED);
     if(regexec(&Regex, Line.c_str(), 1, &Match, 0) == 0)
     {
          TokenList.push_back(Token("NAME", Line.substr(Match.rm_so, Match.rm_eo - Match.rm_so), LineNo, Match.rm_so));

          if(Match.rm_eo != Line.size() - 1)
          {
               Line = Line.substr(Match.rm_eo, Line.size() - Match.rm_eo);
               Lex(Line, TokenList, false);
          }
     }
     regfree(&Regex);

i didnt think it would be because i copied the code from the ones i was using for keywords and strings, which works fine

Duthomhas (13292)

It doesn't make any sense to use a pipe in a character class.

Have you read the documentation for regexec()? You know that you are misinterpreting the result, right?

The first result is for the entire expression.
The next match(es) is(are) for the pieces.

In your case, you have only asked for one thing, so you get that one thing and the entire result of that one thing.

Hope this helps.

Last edited on

closed account (Dy7SLyTq)

ok that makes sense. so then i have a few more questions
a) how can i stop that from happening?
b) why does this happen if i pass the rest of the string past the first match?
c) why doesn't this happen with the exact same code, except the re is \"[^\"]+\" (for strings) and import|function|var|println|end (for keywords)?

Duthomhas (13292)

a) you can't
b) what?
c) idk.

Believe it or not, the reason it happens is so that people can get at the pieces of their advanced regexes. You are actually using some pretty simple ones.

Just ignore everything but the first element.

closed account (Dy7SLyTq)

how do i ignore all but the first element? would i have to do a unique copy thing?

cire (8284)

how do i ignore all but the first element? would i have to do a unique copy thing?

You would ignore all but the first element by not using anything but the first element.

closed account (Dy7SLyTq)

i cant tell if its the first element though. its just pushing them back. and what if its something like this:
_Name _Name in which case i would want it to be two

Duthomhas (13292)

Why would you want to treat "_Name _Name" specially? Is it supposed to be a single lexeme?

Also, you know full well how to just get one thing from a list. The only difference is that your list comes as a series of function calls instead of an indexable array. So how many times do you need to call the function to get the first element?

andywestken (4094)

Out of interest, what's your input when you get repeated "Counter" values?

Andy

Topic archived. No new replies allowed.

C++

Forum

regular expression to catch names