Parsing a string into tokens

May 6, 2012 at 5:04pm
I need some way of parsing a string into tokens. I'm reading the example from a file.

EX:
BEGIN
a := 1 + (a) - 10;
END

where a, :, = , +, -, (,) ,; are all the tokens
May 6, 2012 at 7:41pm
Tokenizing in C is quite simple:

1. Create a copy of the string (because it needs to be writeable memory).
2. Use strtok_s() or wcstok_s() (latter if you are using wide characters, and you should be) to tokenize.

I haven't tokenized in C++, but I think I once saw someone using the extraction operator from a string stream. I think you can tell the string stream which delimeters you want to use and then simply extract strings, which will be the tokens. Look it up.
May 6, 2012 at 7:50pm
May 7, 2012 at 1:48am
if you know that cin already tokenizes on whitespaces, what would change to make it tokenize on any character?
May 7, 2012 at 4:29am
In
1
2
3
BEGIN
abc := 9999 + (abc) - 1234 ;
END


Would abc be one token or three separate tokens a, b, c?

Would 9999 and 1234 be one token each or four tokens each?

And is := one token or two tokens : and =

In short, would this be fine?
1
2
3
BEGIN
ab:c = 99(99 + abc - )12;34 
END
May 7, 2012 at 5:18am
abc is one token, but : and = are separate tokens, also 9999 is one token.
May 7, 2012 at 5:39am
So you need to be able to recognize the tokens in the string first before you can start thinking about how to split the string into tokens.

Start with something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
enum token_type { IDENTIFIER /* eg. 'abcd' */, CONSTANT /* eg. '12345' */, 
                  OPERATOR /* eg '+' */, TERMINATOR /* eg. ';' */ 
                  /* ... etc ... */, INVALID = -1 }; 

// what is the type of the token at the start of string str?
// invariant: str does not contain leading white spaces 
// and how long (how many characters) is the token?
// note: the match is a 'greedy' match - match as many characters as possible
// eg: str contains 'abc12:= 78 ;'
// result: token_type is IDENTIFIER, token_length is 5
// 
// eg: str contains '486)+'
// result: token_type is CONSTANT, token_length is 3
token_type recognize( const std::string& str, /* out */ std::size_t& token_length ) ;


Make sure it is working correctly, and then we can move on to extracting the token.

Hint: #include <regex>
http://en.cppreference.com/w/cpp/regex
Last edited on May 7, 2012 at 5:49am
May 8, 2012 at 1:27am
Is there a way to scan over an enum type to see if it matches a string?
May 8, 2012 at 12:27pm
> Is there a way to scan over an enum type to see if it matches a string?

A lookup table could be used:

1
2
3
enum colour_t { BLACK, RED, GREEN, BLUE } ;

const std::string colour_names[] =  { "BLACK", "RED", "GREEN", "BLUE" } ;



If the question meant: is there a way to see if a string is an identifier?

Something like this would check if a string is a valid C++ identifier:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <string>
#include <cctype>

inline bool is_valid_first_char( char c )
{ return std::isalpha(c) || ( c == '_' ) ; }

inline bool is_valid_char( char c )
{ return is_valid_first_char(c) || std::isdigit(c) ; }

bool is_valid_identifier( const std::string& str )
{
    if( str.empty() || !is_valid_first_char( str[0] ) ) return false ;
    for( std::size_t i = 1 ; i<str.size() ; ++i )
        if( !is_valid_char( str[i] ) ) return false ;
    return true ;
}

Last edited on May 8, 2012 at 12:38pm
May 8, 2012 at 5:43pm
Here's what i've got so far. It splits on every character as a token...

#include <iostream>
#include <string>
#include <fstream>
#include <vector>

using namespace std;
string token_type[] ={"IDENTIFIER","CONSTANT","OPERATOR","KEYWORD","TERMINATOR"};
string special[] = {"(", ")", ":=", ";", ","};
vector <string> mystring;

string myoperator[] = { "+", "-"};
string mykeyword[] = {"BEGIN", "END", "READ", "WRITE"};
bool compare(string);
ifstream indata;
ofstream outdata;

int main()
{
string str="", line="";
cout << "Enter name of file: ";
cin >> str;
string temp, str1; // Enter the file name
indata.open( str.data() ); // Open file
cout << endl;
int token_length=0;
while (!indata.eof() )
{
indata >> line;
if (! compare(line))
{

for (int r=0; r <line.size(); r++)
{
temp =line.substr(r,1);

if( !compare(temp) )
{
str1 += temp;
}
else
{
mystring.push_back(str1);
str1 = "";
mystring.push_back(temp);
}
}
mystring.push_back(str1);
str1 = "";
}
else
mystring.push_back(line);
}
for (int s=0; s< mystring.size(); s++)
{
cout << mystring[s] << endl;
}
return 0;
}

bool compare(string line)
{
for (int i=0; i <2; i++)
{
if (line == myoperator[i])
{
return true;
}
}
for (int j=0; j <4; j++)
{
if (line == mykeyword[j])
{
return true;
}
}
for (int k=0; k <5; k++)
{
if (line == special[k])
{

return true;
}
}
return false;
}
Topic archived. No new replies allowed.