Parsing and Generating

closed account (S6k9GNh0)
Now, I've spent the past week or so thinking about parsing and generating. I've been recommended everything from char* manipulation to Bison/Flex to Spirit to something about tables I honestly don't understand. Would anyone care to either give a few suggestions or a good read? I'm currently reading the flex/bison manual for some of the concepts held throughout but it's limited.

I'm not directly interested in language parsing at the very moment although I would like to eventually address that as well.
Last edited on
Awesome. Me too. I guess I won't have to make a thread about it now.

One thing I don't understand is, why tokenize? Lexical and syntactic analyses are the same thing with slightly different output. Is that an optimization or simplification? I've never seen anything refer to it as an optional step..

And speaking of optimizations, I don't think the traditional method of " text -> list of tokens -> parse tree -> ... " is faster than " text -> ... " ("..." there implies that you're not going to compile/interpret your parse tree).

Though writing a general method to do it all in one step isn't going great for me. My idea is to have a stack of lists. A simple pattern, when matched, would append a character to the top list. A parser pattern would push a new list, call several other patterns, then call a function that converts top list to another value, pop the stack, and append the value to the top list. I hope my explanation made some sense...

I'm worried about type safety. Right now I'm using a heterogeneous stack, but that's really restrictive. I guess I could derive all constructs I need from a single base class, but then I'd have to either blindly trust my ability to write good BNFs or do type checking, which is a lot like parsing the thing twice..

edit: I could have stacks of stacks and let the compiler do the type checking, I guess, since it should all be static..
Last edited on
closed account (1vRz3TCk)
Parsing Techniques: A Practical Guide by Dick Grune would be a good book on the subject of parsing, however, it is not cheap.

You could go to the website[1] and grab the .pdf of the table on contents and use that as a guide to researching on the net.


_______________
[1] http://www.springer.com/computer/swe/book/978-0-387-20248-8
@hamsterman: I did practically the same (also doing the tokenization thing). I have to add one more layer of "pre-tokenization": I am receiving strings from cgi, so I need to translate things such as
%281%2B1%29%2F3
to
(1+1)/3.
If you think about it, translating cgi string to civilized string is done with the same principles as "usual" parsing.

I think of the tokenization part as "simplifying things": tokens are quite compact (just ints, if I am not mistaken), and so are very easy to store in memory and pass around. Besides tokens, I also keep "value" of the token: for example, together with the token "digit", I store a value indicating which digit.

[Edit:] I never read any theory of parsing - instead I learned how to write a parser from this post from helios:
http://www.cplusplus.com/forum/lounge/11845/
(it contains a 207 line parser + evaluator of arithmetic expressions - +,*,/,^ and parentheses ).

Last edited on
closed account (S6k9GNh0)
Actually, I think tokenizing is both simplifying and optimizing. When you parse a line using raw rules, it goes back to the beginning of the line each time a rule fails. It doesn't have to do this with tokens and even if it did, it wouldn't be quite as rough since tokens are easier to deal with and the rules to a certain section have already been applied, and do not have to be tested or applied again. So instead of finding patterns characters directly, you break them up into tokens and then you notice patterns in tokens. That's how I'm taking it in any ways.

That's what someone just told me from ##spirit not to long ago...

EDIT: I went and bought a "Parsing Techniques". I'm reading it now.
Last edited on
Topic archived. No new replies allowed.