Looking YACC grammar for 8085 or any other assembly language

Hi,

i have started working on writing a assembler and need to write Lex/YACC grammar for the same.

i was looking for some already existing grammar for any assembly language so that i can take some help from it.

Thanks
sachin
Come on, it's not like Assembly is that hard to parse. It'd probably take you longer to look for the grammar file than to write it yourself.
Since I was a little bored, I decided to write a simple parser, with a few instructions only. I hope it will help you.

scanner (flex):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
%{

#include "parser.h"

%}

%option noyywrap
%option pointer

CT       "//"[^\n]*
WS       [ \n\r\t]+
ID       [_a-zA-Z][_a-zA-Z0-9]*
NUM      [0-9]+

%%

{CT}     {}
{WS}     {}

"mov"    { return INSTR_MOV; }
"add"    { return INSTR_ADD; }
"inc"    { return INSTR_INC; }
"jmp"    { return INSTR_JMP; }

"eax"    { return REG_EAX; }
"ebx"    { return REG_EBX; }
"ecx"    { return REG_ECX; }
"edx"    { return REG_EDX; }

{ID}     { return IDENTIFIER; }
{NUM}    { return NUMBER; }

","      { return COMMA; }
":"      { return COLON; }
";"      { return SEMICOLON; }

.        { yyerror("syntax error"); }


parser (bison):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
%{

#include <stdio.h>

int    yylex(void);
void   yyerror(const char *);
FILE * yyin;

%}

%token INSTR_MOV INSTR_ADD INSTR_INC INSTR_JMP
%token REG_EAX REG_EBX REG_ECX REG_EDX
%token IDENTIFIER NUMBER COMMA COLON SEMICOLON

%start program

%%

register_name
   : REG_EAX
   | REG_EBX
   | REG_ECX
   | REG_EDX
   ;

label
   : IDENTIFIER COLON
   ;

instruction_mov
   : INSTR_MOV register_name COMMA register_name SEMICOLON
   | INSTR_MOV register_name COMMA NUMBER SEMICOLON
   ;

instruction_add
   : INSTR_ADD register_name COMMA register_name COMMA register_name SEMICOLON
   | INSTR_ADD register_name COMMA register_name COMMA NUMBER SEMICOLON
   ;

instruction_inc
   : INSTR_INC register_name SEMICOLON
   ;

instruction_jmp
   : INSTR_JMP IDENTIFIER SEMICOLON
   ;

instruction
   : instruction_mov
   | instruction_add
   | instruction_inc
   | instruction_jmp
   ;

program
   : instruction
   | label
   | program instruction
   | program label
   ;

%%

void yyerror(const char * message)
{
   printf(message);
}

int main(int argc, char **argv)
{
   if(argc == 2)
   {
      if((yyin = fopen(argv[1], "rb")) != NULL)
      {
         yyparse();
         fclose(yyin);
      }
   }

   return 0;
}


example source file:
1
2
3
4
5
6
7
// Some comment
start:
   mov eax, 10;
   mov ebx, 20;
   add ecx, eax, ebx;
   inc ecx;
   jmp start;


EDIT: I agree with helios, I took me like 5 minutes to write this parser.
Last edited on
GNU as supports Z-80 assembler which is a close relative of the 8085. I'd bet you could lift it from there as long as you don't mind licensing your code under the GPL.
Don't you think it'd be better to condense all instruction_* into a single non-terminal instruction that holds the string?
I'm not sure I understood you helios, but if you suggests that scanner shouldn't detect the exact type of instruction then I don't agree because then:

1) I will have to use IDENTIFIER for both instructions and labels, which is ugly,
2) I will have to verify that arguments of every instruction are correct, and it's better when parser does it,
3) most probably scanner would detect type of instruction much faster than me.

Basically I think that it's best to do as much work as possible in scanner/parser. I know that a huge list of instruction doesn't look nice, but you must place this list somwhere.
I guess we just have different opinions on how to write our languages. I prefer to leave the parser as little responsibility as possible.

1. If you think about it, they both are identifiers. I only need to change the syntax to demonstrate it: mov(eax,10)
2. I think the parser should only resolve the structure of the code. Checking that the structure makes sense should be left for whatever comes next.

There's also the problem that changes such as adding a new instruction or adding more operands to an instruction don't (or shouldn't) constitute a language change. For example, you don't need to rebuild a C compiler if you decide to change memset().
1) OK, I agree.
2) But the parser resolves the structure of the code only. I think of assembler instructions as of something similar to C++ keywords/operators. When an instruction doesn't have one of operands, it's like a C++ operator wouldn't have one of operands. It's a syntax error which should be detected by the parser.

I agree that the compiler needs to be re-compiled (haha) every time you want to add or change an instruction. But seriously, would you really make an assembler compiler to use an external definition of available instructions?

I think that both approaches have their advantages and disadvantages. I just tend to like mine a little more...
an assembler compiler
A what now?

would you really make an assembler compiler to use an external definition of available instructions?
Not external. Parsing is just one of the steps in the compilation process. If I was writing an assembler, I'd write the parser as I stated above, and immediately afterwards detect nonsense such as nonexistent opcodes, or incorrect number of operands. Possibly at the same time as I'm generating code.
A what now?
well, you know what I mean...

Not external
So you still need to rebuild the compiler if any instruction changes.
Yes, but I don't need to regenerate the parser, which is what I meant when I said "language" earlier. The syntax.
Although now that you mention it, an external instruction table doesn't sound so bad. Think about it. You could use the same assembler to generate code for any processor.
Yes, you're right. This actually could work pretty well.

Of course, nothing comes without a price. In this case it would be a need to implement the second parser, which processes the provided table, and creates the appropriate rules for instructions. Simple in theory, but probably a lot of work to really support all processors. I think for example about addressing modes in x86 - a lot of different combinations, which must be handled.

However, the idea seems to be interesting.
Topic archived. No new replies allowed.