isalpha() skips a non-alphabetic character if there is no space in front of it.

Hey guys.

I'm making a simple SQLite compiler via C/C++ and all is going well, except for the isalpha() function.

How the keyword tokenizer works, is that it checks if the current character is alphabetical, if it is, it's going to push back into a char vector array and loop until the next character is not alphabetical.

It works perfectly fine, but for some reason if there isn't a space before the non-alphabetical character, it just completely skips it.

If someone could help me that would be great.


isalpha() function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
    else if(isalpha(this->c)) {
        this->temp.clear();

        while(isalpha(lexer_peek())) {
            this->temp.push_back(this->src[this->p]);
            lexer_advance_char();
        }

        this->value = std::string(begin(this->temp), end(this->temp));
        this->keyword = lexer_check_if_keyword(this->value);

        for(int i = 0; i < TokenType.size(); i++) {
            if(TokenType[i] == keyword) {
                token = Token(this->value.c_str(), TokenType[i].c_str());
            }
        }

    }


Here's how it looks like when there is a space before a non-alphabetical character:


TEST.sql:
1
2
3
4
5
6
DIRECTORY = "TEST.sql"

CREATE TABLE students (
    name    TEXT NOT NULL ,
    lname   TEXT NOT NULL 
);



output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
DIRECTORY :: DIRECTORY
EQUALS :: =
DIRECTORY_URL :: TEST.sql
CREATE :: CREATE
TABLE :: TABLE
IDENTIFIER :: students
LPAREN :: (
IDENTIFIER :: name
TEXT :: TEXT
NOT :: NOT
NULL :: NULL
COMMA :: ,
IDENTIFIER :: lname
TEXT :: TEXT
NOT :: NOT
NULL :: NULL
RPAREN :: )
SEMICOL :: ;
NEWLINE ::


Here's how it looks like when there is not a space before a non-alphabetical character:


TEST.sql:
1
2
3
4
5
6
DIRECTORY = "TEST.sql"

CREATE TABLE students(
    name    TEXT NOT NULL,
    lname   TEXT NOT NULL
);



output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
DIRECTORY :: DIRECTORY
EQUALS :: =
DIRECTORY_URL :: TEST.sql
CREATE :: CREATE
TABLE :: TABLE
IDENTIFIER :: students
IDENTIFIER :: name
TEXT :: TEXT
NOT :: NOT
NULL :: NULL
IDENTIFIER :: lname
TEXT :: TEXT
NOT :: NOT
NULL :: NULL
SEMICOL :: ;
NEWLINE ::


Notice that the left parenthesis, right parenthesis, and the coma doesn't get registered.

I just don't understand why it does that.
If you guys need anything else, just let me know, thanks!
Last edited on
Sounds like a job for a debugger! Try putting a breakpoint into your parsing/lexing function and stepping through it line by line once you get to something that breaks to make sure it is doing what you want. It could be a problem elsewhere in your code that isn't apparent without debugging.
Looking at lines 4-6, you peek at the next character to see if it is alpha. Fine.
If it's not alpha i.e. left paren, you skip over lines 5-6 and fall through to line 9. What happens to the left paren? You haven't really shown enough code to verify what happens to the left paren.

Presumably the left paren is still the next character in your lexer buffer, but I would look outside the code you've shown to verify what happens to the left paren.

BTW, Nice lexer trace.
Last edited on
@AbstractionAnon

The left and right parenthesis just push back the temp vector and set it as a token, pretty simple, but I don't think that's where the issue is.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
    else if(this->c == '(') {
        this->temp.clear();

        this->temp.push_back(this->src[this->p]);
        this->value = std::string(begin(this->temp), end(this->temp));

        token = Token(this->value.c_str(), TokenType[Type_LPAREN].c_str());
    }

    else if(this->c == ')') {
        this->temp.clear();

        this->temp.push_back(this->src[this->p]);
        this->value = std::string(begin(this->temp), end(this->temp));

        token = Token(this->value.c_str(), TokenType[Type_RPAREN].c_str());
    }


Also, thanks for the compliment. I'm still really new to building source to source compilers and C++ in general.
Last edited on
I assume you're writing the lexer for experience with doing so.
You should be aware that there are lexer generators that will build a lexer for you. The classic one is Lex:
https://en.wikipedia.org/wiki/Lex_(software)
My favorite is Flex:
https://en.wikipedia.org/wiki/Flex_(lexical_analyser_generator)
Once you've written a lexer with one of these tools, you will never go back to writing your own.

Yes, these tools generate a C lexer, not C++, but the objective is to get a stream of tokens that are ready to parse.
Last edited on
I notice that the ( and ) cases are not calling lexer_advance_char(), while the identifier case is. Does your loop look like this?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
while (/*there are more characters*/){
    //handle various cases...
    else if (isalpha(this->c)){
        //...
        while (/*...*/){
            //...
            lexer_advance_char(); // <---
        }
        //...
    }
    //handle more cases...

    lexer_advance_char();         // <---
}
If so then yes, that's wrong. You need to decide whether the loop or the cases will take care of advancing the cursor. I usually find it more practical to let the cases take care of it, even if it does mean remember to advance it in each one.
What does lexer_advance_char() do?
isalpha() works individual characters, there can't be a space in front on ... it.

Regarding your code, you only drop into the loop if the first character matches isalpha(). If you want to support a spaces specifically, you probably should add that to the check:
 
else if(isalpha(this->c) || isspace(c)) {
@helios

I should have just shown the whole tokenizer from the start.
Here is the whole function that takes the characters and puts them into tokens.
My apologies for not doing it at the start.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
Token Lexer::lexer_get_token() {

    lexer_skip_whitespace();
    Token token;

    switch(this->c) {
        case '\n':
            this->temp.clear();

            this->temp.push_back(this->src[this->p]);
            this->value = std::string(begin(this->temp), end(this->temp));

            token = Token(this->value.c_str(), TokenType[Type_NEWLINE].c_str());
            break;

        case '\'':
            this->temp.clear();
            lexer_advance_char();

            while(this->c != '\'') {
                this->temp.push_back(this->src[this->p]);
                lexer_advance_char();
            }

            this->value = std::string(begin(this->temp), end(this->temp));
            token = Token(this->value.c_str(), TokenType[Type_STRING].c_str());
            break;

        case '\"':
            this->temp.clear();
            lexer_advance_char();

            while(this->c != '\"') {
                this->temp.push_back(this->src[this->p]);
                lexer_advance_char();
            }

            this->value = std::string(begin(this->temp), end(this->temp));
            token = Token(this->value.c_str(), TokenType[Type_DIRECTORY_URL].c_str());
            break;

        case '*':
            this->temp.clear();

            this->temp.push_back(this->src[this->p]);
            this->value = std::string(begin(this->temp), end(this->temp));

            token = Token(this->value.c_str(), TokenType[Type_ASTERISK].c_str());
            break;

        case '=':
            this->temp.clear();

            this->temp.push_back(this->src[this->p]);
            this->value = std::string(begin(this->temp), end(this->temp));

            token = Token(this->value.c_str(), TokenType[Type_EQUALS].c_str());
            break;

        case '(':
            this->temp.clear();

            this->temp.push_back(this->src[this->p]);
            this->value = std::string(begin(this->temp), end(this->temp));

            token = Token(this->value.c_str(), TokenType[Type_LPAREN].c_str());
            break;

        case ')':
            this->temp.clear();

            this->temp.push_back(this->src[this->p]);
            this->value = std::string(begin(this->temp), end(this->temp));

            token = Token(this->value.c_str(), TokenType[Type_RPAREN].c_str());
            break;

        case ',':
            this->temp.clear();

            this->temp.push_back(this->src[this->p]);
            this->value = std::string(begin(this->temp), end(this->temp));

            token = Token(this->value.c_str(), TokenType[Type_COMMA].c_str());
            break;

        case ';':
            this->temp.clear();

            this->temp.push_back(this->src[this->p]);
            this->value = std::string(begin(this->temp), end(this->temp));

            token = Token(this->value.c_str(), TokenType[Type_SEMICOL].c_str());
            break;

        case '\0':
            this->temp.clear();

            this->temp.push_back(this->src[this->p]);
            this->value = std::string(begin(this->temp), end(this->temp));

            token = Token(this->value.c_str(), TokenType[Type_EOF].c_str());
            break;

    }

    if(isalpha(this->c)) {
        this->temp.clear();

        while(isalpha(lexer_peek())) {
            this->temp.push_back(this->src[this->p]);
            lexer_advance_char();
        }

        this->value = std::string(begin(this->temp), end(this->temp));
        this->keyword = lexer_check_if_keyword(this->value);

        for(int i = 0; i < TokenType.size(); i++) {
            if(TokenType[i] == keyword) {
                token = Token(this->value.c_str(), TokenType[i].c_str());
            }
        }
    }

    lexer_advance_char();
    return token;

}


the main.cpp loop is this:
1
2
3
4
5
6
7
    Lexer lexer(input);
    Token token = lexer.lexer_get_token();

    while(token.type != lexer.TokenType[lexer.Type_EOF]) {
        std::cout << token.type << " :: " << token.value << std::endl;
        token = lexer.lexer_get_token();
    }
Last edited on
gabriel11 wrote:
I should have just dropped the whole tokenizer from the start.

C/C++ stdlibs already have string tokenization functionality provided by the <string.h>/<cstring> library.

https://en.cppreference.com/w/c/string/byte/strtok
https://en.cppreference.com/w/cpp/string/byte/strtok

Notice there are some differences between what the C stdlib and the C++ stdlib provides.

strtok is destructive to the string, so best to use a copy for tokenizing.

Boost has a non-destructive tokenizer library.

https://www.boost.org/doc/libs/1_80_0/libs/tokenizer/doc/introduc.htm

The C stdlib function requires the delimeters be specified, the Boost class has default punctuation and spaces that can be overridden.
C/C++ stdlibs already have string tokenization functionality provided by the <string.h>/<cstring> library.


I know that there are a ton of resources out that make developing lexers and tokenizers much easier. Just like how @AbstractionAnon said.

But I'm more interested in physically doing it myself, for the experience and for the knowledge.

But thanks for showing me that stdlib has its own tokenizer. I'll definitely use it my future products!
Last edited on
Personally I would use Boost's Tokenizer, for a couple of reasons.

1. Your code is C++

2. It has a default delimiter setup that can be modified as needed.

3. It is non-destructive to the string being tokenized.

4. It looks to be able to work with C (char arrays) and C++ strings. As long as the container can provide/use begin/end iterators we're good to go.

With C++ code I'd use a C++ string. You can get a C string from that as warranted.

I've not had a lot of experience using either tokenizer, it has been a while since I had the need.

I do fully understand about the do-something-to-learn "shiny thing." I do it a lot. Best way to broaden one's understanding of what can be done.

Mucking around with custom code makes me appreciate what the C/C++ stdlibs provide. Or select 3rd party libraries that are well documented and peer-reviewed.

Too often, though, a custom-crufted "solution" is used when the stdlibs/peer-reviewed 3rd party lib should be. And lacks sufficient "bullet-proofing" for when -- not if -- things go wrong.
Just to emphasise a point with strtok(). It is destructive. It changes the data upon which it operates. Also you can't use strtok() on more than one set of data at once as it maintains internally static data.

[quote]This function is destructive: it writes the '\0' characters in the elements of the string str. In particular, a string literal cannot be used as the first argument of std::strtok.

Each call to this function modifies a static variable: is not thread safe. /quote]
Shouldn't L107 - L123 be part of a switch default clause?

If you find say '(', then you deal with it, set token etc and then exit the switch and then check for isalpha() ??? This is OK if there's a space as isalpha fails(), the function is exited and then next time entered lexer_skip_whitespace() is called to remove the space(s). But if there is no space between say the ( and an alpha char, then the isalpha() is true, the alpha token is extracted and overwrites the ( token. Hence you are missing the symbol token!

Last edited on
This is (more or less) the pattern I use for my own lexers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#include <cctype>
#include <string_view>
#include <iostream>

enum class character_group: unsigned int { space = -2u, alpha = -1u, };
[[nodiscard]] static character_group classify(unsigned char c)
{
  return std::isalpha(c)? character_group::alpha:
         std::isspace(c)? character_group::space: character_group(c);
}

class lexer
{
  char const *beg, *end, *buf_end; 

public:
  explicit lexer(std::string_view sv)
    : beg{ sv.data() }, end{ sv.data() }, buf_end{ sv.data() + sv.size() } {}

  [[nodiscard]] std::string_view next_lexeme()
  {
    if (beg = end; end != buf_end)
      switch (character_group g = classify(*beg))
      {
        case character_group::space: 
        case character_group::alpha:
          do ++end; while (end != buf_end && g == classify(*end)); break; 
        default: ++end; break;
      }
    return std::string_view(beg, end - beg); 
  }
};

int main()
{
  for (std::string line; std::getline(std::cin, line); )
  {
    lexer my_lexer { line };

    while (true) 
      if (std::string_view lexeme = my_lexer.next_lexeme(); lexeme.size()) 
        std::cout << '\"' << lexeme << "\"\n"; 
      else break;
  }
}


1
2
3
4
5
6
7
    else if(isalpha(this->c)) {
        this->temp.clear();

        while(isalpha(lexer_peek())) {
            this->temp.push_back(this->src[this->p]);
            lexer_advance_char();
        }


It looks like you're expecting this->c, lexer_peek() and this->src[this->p] to all return the same value. That's some very fragile design. You should decide how you're going to access "the next character" and use it consistently.
Topic archived. No new replies allowed.