I use ocamllex and ocamlyacc as the parser generator. They are doing fine jobs, although people recommend Menhir a lot. It turns out that there are just a few error patterns that I need to know before working with ocamlyacc effectively. Here is a summary of debugging the lexer and parser.
Empty token
Fatal error: exception Failure("lexing: empty token")
means that some token in the program does not match any of the regular expressions in our lexer.
I don’t have good solution for this. One simple solution is to print out matched tokens as the lexer goes. It shall be also easy to catch this exception and print out the lexbuf when errors occur. Simply printing information works just fine for debugging the lexer.
pares_error exception
parse_error exception seems useless: it only takes a string in its construct; there is no position information in it. How can we provide a better error message with it?
To provide better error message, we need to access the lexbuf for the current position and current reading symbol. Remember the lexbuf we pass to the parser and lexer? We need to catch errors there.
Let’s see the implementation in tiger compiler. What you need to do is
(** [row col token] indicates where the error happens. *)
exception ParseError of int * int * string
(** [parse_string lexbuf] parses the given lexbuf and returns
tiger abstarct syntax. @raise ParseError when parser
fails. *)
let parse_lexbuf lexbuf : S.exp =
try
prog tokenize lexbuf
with
| Parsing.Parse_error ->
let pos = lexbuf.Lexing.lex_start_p in
let row = pos.Lexing.pos_lnum in
let col = pos.Lexing.pos_cnum - pos.Lexing.pos_bol in
raise (ParseError(row, col, Lexing.lexeme lexbuf))
| Failure msg -> failwith msg
prog
is yacc generated entry point of our parser, and tokenize
is the lexer. If parsing raises Parse_error
at any point, we catch it and extract useful position and content information from lexbuf
.
It’s useful to propagate Failure exception of other runtime errors.
Debug the parser
I find the following methods pretty productive during debugging ocamlyacc.
The first is to turn on parser trace when parsing a program
export OCAMLRUNPARAM='p'
After turning this on, parsing
let var x := 12 in x
will output parsing information:
State 0: shift to state 1
State 1: read token LET
State 1: shift to state 9
State 9: read token VAR
State 9: shift to state 24
State 24: read token Id(x)
State 24: shift to state 61
State 61: read token COLON
State 61: shift to state 83
State 83: read token EQ
Discarding state 83
Discarding state 61
Discarding state 24
Discarding state 9
Discarding state 1
No more states to discard
But what are these states? Flag -v
comes to help. This flag should be passed to ocamlyacc. I use ocamlbuild
, so in my makefile, I do ocamlbuild -yaccflag -v main.byte
, and read all the transition states in _build/frontend/parser.output
.
ocamlyacc grammar
how is unary precedence normally solved
If you don’t specify the precedence of the unary operators in your language, you’ll normally get something like this:
shift/reduce conflict
expr : MINUS expr .
expr : expr . AND expr
To solve this, declare another symbol UMINUS
which has the highest precedence.
%left MINUS
...
%nonassoc UMINUS
Then overwrite the unary rule using %prec
.
expr:
| MINUS expr %prec UMINUS { ... }
By the way, there are limited resources about ocamlyacc. You won’t get many if you google ocamlyacc
. But ocamlyacc
is essentially the same as yacc
, except for the different syntax. Reading yacc (or bison) manual definitely helps.
non-terminal may cause shift/reduce conflict
The symptom here is even if you have already specified %left
for binary operators, ocamlyacc still complains about shift/reduce problem for the following rules:
expr:
| expr op expr
op:
| PLUS { ... }
| MINUS { ... }
op is used to match the PLUS, MIINUS, etc. You’ll get lots of conflict
65: shift/reduce conflict (shift 33, reduce 7) on PLUS
65: shift/reduce conflict (shift 34, reduce 7) on MINUS
65: shift/reduce conflict (shift 35, reduce 7) on TIMES
65: shift/reduce conflict (shift 36, reduce 7) on DIV
65: shift/reduce conflict (shift 37, reduce 7) on EQ
65: shift/reduce conflict (shift 38, reduce 7) on NEQ
65: shift/reduce conflict (shift 39, reduce 7) on LT
65: shift/reduce conflict (shift 40, reduce 7) on GT
65: shift/reduce conflict (shift 41, reduce 7) on LE
65: shift/reduce conflict (shift 42, reduce 7) on GE
state 65
expr : expr . op expr (7)
expr : expr op expr . (7)
state 33
op : PLUS .
state 34
op : MINUS .
If you write directly expr PLUS expr
, there should be no problem. The shift/reduce problem here is about shift the PLUS
(to state 33), or reduce expr op expr .
, but here op
is another non-terminal, and the precedence of a rule is given by the last token occurring on the right-hand side of that rule. Instead of write op
, you can write
expr:
| op { ... }
op:
| expr PLUS expr { ... }
| expr MINUS expr { ... }
Understanding the precedence of a rule
Tiger does not have an end mark to its while loop, so it is pretty easy to get a shift/reduce conflict. Take this example which has a relational operation inside the body of a while.
95: shift/reduce conflict (shift 34, reduce 16) on PLUS
expr : WHILE expr DO expr . (16)
op : expr . AND expr (19)
In this example, we want an expr
taking as long as it wants, so we want AND
to shift. There are two methods to solve this problem. The easiest one is to add an END
as part of the syntax of while
. Or we let expr
have higher precedence. The precedence of a rule is the last token occurring on the right-hand side of that rule, so it depends on DO
. Therefore, defining %right
for DO
will solve the problem.