Establishing command keywords in an interpreter in C++

Hello, all. I'm kinda new here, working on a new little thing. I was curious as to the nature of command prompts and the like, mainly having coded in Python. As such, I decided to try and make an interpreter, but it's kind of... screwy to work with, as it were. As such, I'd like to fix these issues one at a time. First up, I'm trying to make a kind of user interface that accepts user-inputed commands such as "help()", "read(<filename>)", and so on. However, I'm having a hard time getting it working with my current code, specifically in regards to linking commands to keywords followed by parentheses with parameters inside. Here's the code that begins the execution of some basic commands received by user input. How can I clean this up to be more modular in terms of programming? Oh yeah, this is in C++, thanks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
  getline(cin, userInput);
		if (!userInput.empty())
		{
			userLog.push_back(userInput);
			lexAnalysis.lineScan(userInput);
			if (userInput == "quit" || userInput == "quit()")
			{
				quit();
			}
			else if (userInput == "help" || userInput == "help()")
			{
				help();
			}
			else if (userInput == "clear" || userInput == "clear()")
			{
				clear();
			}
			else if (userInput == "show" || userInput == "show()")
			{
				show();
			}
			else if (userInput.substr(0, 5) == "read(")
			{
				unsigned firstPos = userInput.find("(");
				unsigned lastPos = userInput.find(")");
				string fileName = userInput.substr(firstPos + 2, lastPos - firstPos - 3);
				read(fileName);
			}
		};
Last edited on
your best bet is to parse the input into tokens.
so if they have
read( filename )
you can use ( as a delimiter and ignore spaces, or use spaces as a delimiter, or whatnot.
I advise you also allow "" to indicate that everything inside is literal; for example file names with spaces in them is allowed.
regardless, you would parse it into strings like read, filename
and then
if(token1 == read)
filedata = read(token2); //your function read, accepting the one parameter, returning the file contents...

you can get rid of all this else if chaining, but it would likely require learning some additional tricks off the beaten path. Get it working like this first, but be aware you can condense that down if you want to.
@jonnin

Thanks for the insight, I'll try rewriting into that format. Should clean things up nicely.
If you parse into tokens as part of a lexical phase, then a statement such as:

 
read(filename)


becomes 4 tokens:

read
(
filename
)


But rather than having a token containing a name, it can contain an enum value. So say read is enum 1, write is 2, ( is 3, ) is 4, <name> 5, <num> 6 etc. Then you get

1 3 5 4 for tokens

If you have an optional value associated with a token, then for say <name> this can contain an 'index' into a name table, <num> contains its value etc etc etc.

Once you have the tokens, then you can do syntax analysis based upon the token value - which is easier than doing it on strings etc. If you design things right, you only need to know at most the following token to successfully analysis the line (one token look-ahead).
to avoid confusion:
becomes 4 tokens:

I was saying () were delimiters, which can be considered a type of token. I think of them distinctly because delimiters and whitespace often vanish (eg getline with '(' ) before parser sees it.

----------
yes, converting the input strings to integers is a big help, then you can switch() off them to choose what to do or you can map them to a function pointer to call the right function (if all the functions accepted the same input, say your remaining token list... this would reduce the if/else block to the same single line of code). (passing the token list lets each function take what it needs, eg file name, binary or text mode, ... options ... the read() could access the next N tokens if needed)
Last edited on

I'm a bit fried at the moment, and the token tutorials don't necessarily help in my sleep-deprived state. Could you possibly walk me through the token-splitting in code?

For example, if I established a previous string variable as shown down below, how would I split it into tokens?

1
2
3

string exampleString = 'read("filename")';
Last edited on
There's absolutely loads of info on the Internet re lexical analysis.

For a practical book, have a look at 'Crafting Interpreters' by Robert Nystrom
https://www.amazon.co.uk/gp/product/0990582930/

The approach taken can vary depending upon the language to be parsed. However I would suggest that first you produce a syntax 'diagram' for the 'language'. As an example, look at these digrams
https://www.google.com/search?q=syntax+diagram&client=firefox-b-d&tbm=isch&source=iu&ictx=1&vet=1&fir=x5dlHlWwSk-8dM%252CHKH4NISE0LexpM%252C_%253BDCz_R-y9zYbTUM%252Cxbur5A6t-CKZ0M%252C_%253BpBiL_8g4l3hxpM%252C5NjRPXrKg7LNnM%252C_%253BfK9q37oZdMDZqM%252CR3IRUd8h_Ct0KM%252C_&usg=AI4_-kQB46QklgefIkPbyLglu7FRkxZpng&sa=X&ved=2ahUKEwjt1oXHnbn1AhURTsAKHcEzA0IQ_h16BAgDEAE&biw=1473&bih=713&dpr=1.25#imgrc=pBiL_8g4l3hxpM
you can rig stringstream or substr to do it, or you can just DIY.
the DIY at least helps you understand what needs doing.
here, we pick delimiters up front, then
process each letter of input. if it is a delimiter, take the string we have found and save it, continue looking for more stuff. See how the delimiters go away in this approach?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
int main()
{
string ex = "read(\"filename\")";
vector<string>toks;
bool delims[256]{false};
delims ['\"'] = delims['('] = delims[')'] =true;
string tmp;
	for(char c : ex)
	{
	 if(delims[c])
	 {
	  if(tmp.length())
	  toks.push_back(tmp);
	  tmp = "";
	 }
	else tmp.push_back(c);
	}
	for( auto &a : toks)
	 cout << a << endl;
}
Yeah - but read)"filename"( would also be accepted...
correct, it does not validate. The smarter you make it, and the more complicated your language, the more work you will need to validate and interpret it. Allowing () adds complexity.

My answer would be "do not allow ()" or other double symbol tokens. If you insist, you need a counter so if you see ( paren++ and when you see ) paren-- if its not zero at the end of the line, its bad, and if it goes negative, its bad (unless you allow spanning lines, like {} in c++, then you probably will need something much more substantial than a one pass mindlessly simple letter crawler).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
int main()
{
string ex = "read(\"filename\")";
vector<string>toks;
bool delims[256]{false};
delims ['\"'] = delims['('] = delims[')'] =true;
int parenctr{};
string tmp;
	for(char c : ex)
	{
	 if(delims[c])
	 {
           parenctr += (c=='(');
           parenctr -= (c==')');
            if(parenctr < 0 ) {cout << "error"; return 0;}

	  if(tmp.length())
          {
	     toks.push_back(tmp);
             if(parenctr) {cout << "error"; return 0;} //it should be zero here, if each line is self contained
          }
	  tmp = "";
	 }
	else tmp.push_back(c);
	}
	for( auto &a : toks)
	 cout << a << endl;
}


to which the observant coder might say "well now it accepts read("fubar);
... so you need a quote counter too, if you care. Windows accepts one quote in many commands, it just is a token that tells the cmd that a multi word string is coming.
it also accepts derpy () pairs, like (read()()(filename()))
which may be gibberish in your new language, or valid, but this overly simple parser will certainly take it.

if you want to get really fancy, this may be a place for regx.
if you want to just get it done, maybe you should use a compiler-compiler or similar tool.
if you want to do it yourself using string manipulation from ground zero up, AND you want it smart and rich, you need to read up on better techniques. I use a lot of really dumb, really simple parsers like above because I have been used to dealing with machine spew (eg a device that spits out json or xml or nmea etc) where the data is very unlikely to contain 'coding errors' so a simple run through it parser works.
Last edited on
Are you going to do lexical analysis to generate tokens and then syntax analysis to check syntax of what was entered - or are you going to 'combine' these and have lexical part of syntax. The first generates say a vector of tokens as first pass and then as second pass analysis the tokens. The second uses a function like getNext() which calls the lexical part to create and return the next token from the input string. This method assumes you can syntax check with just say 1 token look-a-head.

IMO the first thing to do is to create the language design and diagram.

Topic archived. No new replies allowed.