I kind of get how this parser is supposed to work.. you have tokens and rules, a statement is only valid if it "equals" a rule. This rule is, finally, subdivided into tokens. Is that right? Also, taking this concept into account, would it be possible to recursively define a rule as a token?
There are two kinds of objects: terminals (also known as tokens), and non-terminals. A terminal is the smallest lexical unit: an identifier, a literal constant, a plus sign, etc. A non-terminal is a complex object made up of one or more objects. Examples of non-terminals are expressions, statements, code blocks, complete programs, etc. Parsing a stream of tokens involves, explicitly or implicitly, building a tree (an abstract syntax tree) whose branches are non-terminals and whose leaves are terminals. For example, the input "1+2*3" can be translated to the following AST:
expr(expr(INTEGER),PLUS,expr(expr(INTEGER),MUL,expr(INTEGER)))
The process of building a non-terminal from one or objects is called reducing. This is where the rules come in. A rule tells the parser exactly what sequence of objects can be reduced into what non-terminal, and under what conditions. For example, the sequence expr PLUS expr shouldn't be reduced to a single expr if the following token is MUL (because then 1+2*3 == (1+2)*3).
My friend and me have gotten into some more detail and we're making some progress, finally. :P I'm going to try and write a more simpler LR parser, first.
So.. reducing is basically converting more terminals into non-terminals, based on the defined rules?
EDIT:
You also said that a literal constant is one of the "smallest lexical units", but isn't a normal integer capable of being built out of more than one number (12 consists out of 2 numbers, for example)?
So.. reducing is basically converting more terminals into non-terminals, based on the defined rules?
Not necessarily terminals. You can reduce a string of non-terminals into a single non-terminal.
isn't a normal integer capable of being built out of more than one number (12 consists out of 2 numbers, for example)?
Don't confuse terms.
Number: a value that represents a distance from a point to the point of origin, generally on the Real line.
Numeral: a symbol used to represent a number. E.g. the character '5', or a 5 V current flowing through a conductor.
Digit: in a positional system, a position on [the string of numerals that represents a number]. E.g. in the decimal string "1024", '4' isn't a digit, '4' is the value digit 0 holds.
A token is a unit of meaning. The characters '1' and '2' have no meaning in themselves, but when they appear adjacent to each other they create the string "12", which can have meaning. It can represent how many eggs are in a fridge, or a distance in meters. A simple character can't do any of these without first being assigned a meaning when it appear by itself. E.g. in C '/' means "division" when it appears by itself, but it losses meaning when there's a '*' immediately after.
It's possible to write a parser that treats numerals as tokens and then strings them together by reducing expressions, but in practice this is never done because it's much more efficient to let the lexer do it. And besides, treating literals as non-terminals is conceptually obtuse.
Man I'm amaze how much knowledge you guys have. I never knew a compiler theory exist untill helios mentioned it.
This has been interesting for me. Can anyone tell me how and where to learn more about parsing and bytecode? Is a .swf file (flash) considered a bytecode?
Ok, the vast majority of people on the Internet (who aren't Chinese) are American. I thought I saw somewhere that Americans made up the majority of Internet users; perhaps it was just English-speaking Internet users and I didn't notice.
I'm going to try to get my hands on the Dragon Book in a minute. Helios, I re-checked your parser code, and it's becoming more clear what it does by the second, just a few things:
You define this as your constructor for the Rule-struct:
I understand that you reduce more tokens to "types" (Token::INT or Token::NULL), to be able to handle them as these respective types if necessary. Now, what does the lookahead token do, what are constraints and what do the triple periods indicate?
Thanks in advance.
EDIT:
Figured that the constraints indicate the amount of arguments after that point, and that the lookahead indicates the precedence the rule has. .. Am I right on this? (Still want to know how to use the triple-periods properly, though)
That's handled by the lexer. You could define your operator as is equal to without changing a single line in the parsing algorithm (other than the lexer, obviously).
PS: Next time you need to say something else, either make a new post or delete your previous one. I only saw your edit by coincidence.
I just decided to pop down to this thread to wish Kyon good luck with his language. I'd also like to one-up the Dragon Book, as well as most of what helios said in his posts.
So the only thing the parser really does is reduce terminals to terminals/non-terminals, and throw an error now and then?
What is different between this->lookahead=la; and lookahead = la;?