Lexeme Tokens

Hello, I'm having trouble returning the correct number of Tokens in a file.
For example, the input of the file is as follows:
1
2
3
4
"a regular string"
"skipping\nto a new line"
"this is a \\ backslash"
"\d\i\d\ \t\h\i\s\ \w\o\r\k\? ... \"maybe\""


The output should be:
1
2
Lines: 4                                                                                                                                                 
Tokens: 4 


However, my output for the Tokens is 6. The output for me Lines is correct.
So, I tried to see the input from the program and this is what it tells me,
1
2
3
4
5
6
SCONST(a regular string)                                                                                                                               
SCONST(skipping\nto a new line)                                                                                                                        
SCONST(this is a \\ backslash)                                                                                                                         
SCONST(\d\i\d\ \t\h\i\s\ \w\o\r\k\? ... )                                                                                                              
SCONST(maybe)                                                                                                                                          
SCONST()


The problem with this is that I think it is skipping the backslash and double quotes which whatever is inside is supposed to be 1 whole string (SCONST). So, in my program when it sees the double quotes it takes it and put it in a new lines, thus counting separated.

This is the lexe analyzer for that problem.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
lexeme += ch;
              
      if (ch == '\n') {
         return Tok(ERR,lexeme,linenum);
       }
                   
      if ( ch == '\\'){
         in.get(ch);
         in.putback(ch);
       }
      
      if(ch == '"'){
         lexeme = lexeme.substr(1,lexeme.length()-2);
         return Tok(SCONST,lexeme,linenum);
       }
Last edited on
How are you supposed to get an output of 4 tokens from that input? Perhaps you could explain.
That's because each set of string (what's inside two " ") means one whole token
When you read the back slash, you then read the next character. This corresponds with escaping the special meaning of the character. However, then you put that character back in the stream. Why?

Instead, When you read the escape character ("\"), you want to read the next character and consider it as the read character. That is the character you want to place in lexeme.

You need to figure out the order of evaluation so you know that an escaped '"' is treated as a normal character and not the end of a token.

Edit, fixed escape character.
Last edited on
Yes, @doug4 that's what I believe too with the '"' not being treated as the end of a token after it follows a backslash.

I had something in mind like this,
1
2
3
4
if (ch == '\\' && in.peek() == '"'){

return Tok(SCOST,lexeme,linennum);
}

The problem with this is that if I do that then the test case from above skip this if statement and goes to the
if (ch == '\n') giving me the an error.
Last edited on
When you read the escape character ("\"), you want to read the next character and consider it as the read character. That is the character you want to place in lexeme.


You need to figure out the order of evaluation so you know that an escaped '"' is treated as a normal character and not the end of a token.


So pull out pencil and paper and figure out what characters you want placed in lexeme from various inputs. You need to figure out which special conditions you need to test first (probably '" first) and when to add the character to lexeme. Take your time and write it out. Don't expect one of us to do your analysis for you. When you figure out your strategy (also called 'design'), coding will be pretty straight-forward.

Also remember to use 'else' clauses to bypass code when logically necessary.

Something else to consider: what happens when '\' is the last character of the line. You should probably skip the NL (and CR if on windows) and move on to the next line, appending the next line to lexeme.

Parsing is not easy and takes quite a bit of thought and analysis.
Topic archived. No new replies allowed.