regular expression to catch names

Sep 22, 2013 at 12:56am
closed account (Dy7SLyTq)
so im using the re [_|a-z|A-Z]+[_|a-z|A-Z|0-9]+ to find names for my lexer, such as _Name or system or varName34, ie it allows for the same naming conventions as c/c++. however using this makes it so it will find it twice. ie lets say a variable name was Counter, it would find that Counter, and then the same Counter. could someone please tell me what is wrong with my re?
Sep 22, 2013 at 1:20am
I don't know how it could find a single word twice. Are you sure it isn't finding two adjacent occurrences of the same word?

BTW, remember that an RE will glob as much as it can, so you need to give it an opportunity to stop. Also, I'm not sure what the pipes are there for.

[_a-zA-Z][_a-zA-Z0-9]* should work.

If you can, use character classes/bracket expressions:

(_|[:alpha:])\w*

This will work with Unicode text and non-English language identifiers. (Assuming your RE engine can do it.)

Hope this helps.
Sep 22, 2013 at 1:53am
Your regex seems to behave more or less OK to me; no duplicates. (The more or less is because it found the Ring part of 1Ring.)

Given expression:

"[_|A-Z|a-z]+[_|A-Z|a-z|0-9]+"


And input:

"123 Hello Counter World 456 Example 1Ring _TheEnd"


The output is:

Hello
Counter
World
Example
Ring
_TheEnd


using

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <iostream>
#include <string>
#include <regex.h>

using std::cout;
using std::endl;
using std::string;

void Test(const string &Pattern, const string& Line);

int main()
{
    Test( "[_|A-Z|a-z]+[_|A-Z|a-z|0-9]+",
          "123 Hello Counter World 456 Example 1Ring _TheEnd" );

    return 0;
}

void Test(const string &Pattern, const string &Line)
{
    regex_t Regex = {0};
    regcomp(&Regex, Pattern.c_str(), REG_EXTENDED);

    regmatch_t Match = {0};
    const char* pch = Line.c_str();
    while(regexec(&Regex, pch, 1, &Match, 0) == 0) {
#ifdef DEBUG
        cout << "Match.rm_so = " << Match.rm_so << endl;
        cout << "Match.rm_eo = " << Match.rm_eo << endl;
#endif
        string Substr(pch + Match.rm_so, Match.rm_eo - Match.rm_so);
        cout << Substr << endl;
        pch += Match.rm_eo;
    }

    regfree(&Regex);
}


Andy
Last edited on Sep 22, 2013 at 1:55am
Sep 22, 2013 at 2:17am
By the way, you ask for strings of at least 2 elements.
Sep 22, 2013 at 3:21am
closed account (Dy7SLyTq)
@duoas: i was using the pipes for or. i am still relatively new to re's and didnt know that i could do without them. and yeah its copying it because its doing it for words that i know there is only one of.

@everyone: is it my code then? i switched the re to one suggested and its still doing it. could it be something in my code?
1
2
3
4
5
6
7
8
9
10
11
12
     regcomp(&Regex, "[_a-zA-Z][_a-zA-Z0-9]*", REG_EXTENDED);
     if(regexec(&Regex, Line.c_str(), 1, &Match, 0) == 0)
     {
          TokenList.push_back(Token("NAME", Line.substr(Match.rm_so, Match.rm_eo - Match.rm_so), LineNo, Match.rm_so));

          if(Match.rm_eo != Line.size() - 1)
          {
               Line = Line.substr(Match.rm_eo, Line.size() - Match.rm_eo);
               Lex(Line, TokenList, false);
          }
     }
     regfree(&Regex);
i didnt think it would be because i copied the code from the ones i was using for keywords and strings, which works fine
Sep 22, 2013 at 3:33am
It doesn't make any sense to use a pipe in a character class.

Have you read the documentation for regexec()? You know that you are misinterpreting the result, right?

The first result is for the entire expression.
The next match(es) is(are) for the pieces.

In your case, you have only asked for one thing, so you get that one thing and the entire result of that one thing.

Hope this helps.
Last edited on Sep 22, 2013 at 3:34am
Sep 22, 2013 at 4:01am
closed account (Dy7SLyTq)
ok that makes sense. so then i have a few more questions
a) how can i stop that from happening?
b) why does this happen if i pass the rest of the string past the first match?
c) why doesn't this happen with the exact same code, except the re is \"[^\"]+\" (for strings) and import|function|var|println|end (for keywords)?
Sep 22, 2013 at 4:11am
a) you can't
b) what?
c) idk.

Believe it or not, the reason it happens is so that people can get at the pieces of their advanced regexes. You are actually using some pretty simple ones.

Just ignore everything but the first element.
Sep 22, 2013 at 4:18am
closed account (Dy7SLyTq)
how do i ignore all but the first element? would i have to do a unique copy thing?
Sep 22, 2013 at 4:31am
how do i ignore all but the first element? would i have to do a unique copy thing?


You would ignore all but the first element by not using anything but the first element.
Sep 22, 2013 at 4:43am
closed account (Dy7SLyTq)
i cant tell if its the first element though. its just pushing them back. and what if its something like this:
_Name _Name in which case i would want it to be two
Sep 22, 2013 at 3:16pm
Why would you want to treat "_Name _Name" specially? Is it supposed to be a single lexeme?

Also, you know full well how to just get one thing from a list. The only difference is that your list comes as a series of function calls instead of an indexable array. So how many times do you need to call the function to get the first element?
Sep 22, 2013 at 4:03pm
Out of interest, what's your input when you get repeated "Counter" values?

Andy
Topic archived. No new replies allowed.