Tiger Compiler Notes: Lexer and Parser

I use ocamllex and ocamlyacc as the parser generator. They are doing fine jobs, although people recommend Menhir a lot. It turns out that there are just a few error patterns that I need to know before working with ocamlyacc effectively. Here is a summary of debugging the lexer and parser.

Empty token

Fatal error: exception Failure("lexing: empty token")

means that some token in the program does not match any of the regular expressions in our lexer.

I don’t have good solution for this. One simple solution is to print out matched tokens as the lexer goes. It shall be also easy to catch this exception and print out the lexbuf when errors occur. Simply printing information works just fine for debugging the lexer.

pares_error exception

parse_error exception seems useless: it only takes a string in its construct; there is no position information in it. How can we provide a better error message with it?

To provide better error message, we need to access the lexbuf for the current position and current reading symbol. Remember the lexbuf we pass to the parser and lexer? We need to catch errors there.

Let’s see the implementation in tiger compiler. What you need to do is

(** [row col token] indicates where the error happens. *)
exception ParseError of int * int * string

(** [parse_string lexbuf] parses the given lexbuf and returns
    tiger abstarct syntax.  @raise ParseError when parser
    fails. *)
let parse_lexbuf lexbuf : S.exp =
  try
    prog tokenize lexbuf
  with
  | Parsing.Parse_error ->
     let pos = lexbuf.Lexing.lex_start_p in
     let row = pos.Lexing.pos_lnum in
     let col = pos.Lexing.pos_cnum - pos.Lexing.pos_bol in
     raise (ParseError(row, col, Lexing.lexeme lexbuf))
  | Failure msg -> failwith msg

prog is yacc generated entry point of our parser, and tokenize is the lexer. If parsing raises Parse_error at any point, we catch it and extract useful position and content information from lexbuf.

It’s useful to propagate Failure exception of other runtime errors.

Debug the parser

I find the following methods pretty productive during debugging ocamlyacc.

The first is to turn on parser trace when parsing a program

export OCAMLRUNPARAM='p'

After turning this on, parsing

let var x := 12 in x

will output parsing information:

State 0: shift to state 1
State 1: read token LET
State 1: shift to state 9
State 9: read token VAR
State 9: shift to state 24
State 24: read token Id(x)
State 24: shift to state 61
State 61: read token COLON
State 61: shift to state 83
State 83: read token EQ
Discarding state 83
Discarding state 61
Discarding state 24
Discarding state 9
Discarding state 1
No more states to discard

But what are these states? Flag -v comes to help. This flag should be passed to ocamlyacc. I use ocamlbuild, so in my makefile, I do ocamlbuild -yaccflag -v main.byte, and read all the transition states in _build/frontend/parser.output.

ocamlyacc grammar

how is unary precedence normally solved

If you don’t specify the precedence of the unary operators in your language, you’ll normally get something like this:

shift/reduce conflict
expr : MINUS expr .
expr : expr . AND expr

To solve this, declare another symbol UMINUS which has the highest precedence.

%left MINUS
...
%nonassoc UMINUS

Then overwrite the unary rule using %prec.

expr:
 | MINUS expr %prec UMINUS { ... }

By the way, there are limited resources about ocamlyacc. You won’t get many if you google ocamlyacc. But ocamlyacc is essentially the same as yacc, except for the different syntax. Reading yacc (or bison) manual definitely helps.

non-terminal may cause shift/reduce conflict

The symptom here is even if you have already specified %left for binary operators, ocamlyacc still complains about shift/reduce problem for the following rules:

expr:
 | expr op expr

op:
 | PLUS { ... }
 | MINUS { ... }

op is used to match the PLUS, MIINUS, etc. You’ll get lots of conflict

65: shift/reduce conflict (shift 33, reduce 7) on PLUS
65: shift/reduce conflict (shift 34, reduce 7) on MINUS
65: shift/reduce conflict (shift 35, reduce 7) on TIMES
65: shift/reduce conflict (shift 36, reduce 7) on DIV
65: shift/reduce conflict (shift 37, reduce 7) on EQ
65: shift/reduce conflict (shift 38, reduce 7) on NEQ
65: shift/reduce conflict (shift 39, reduce 7) on LT
65: shift/reduce conflict (shift 40, reduce 7) on GT
65: shift/reduce conflict (shift 41, reduce 7) on LE
65: shift/reduce conflict (shift 42, reduce 7) on GE
state 65
        expr : expr . op expr  (7)
        expr : expr op expr .  (7)
state 33
        op : PLUS .
state 34
        op : MINUS .

If you write directly expr PLUS expr, there should be no problem. The shift/reduce problem here is about shift the PLUS (to state 33), or reduce expr op expr ., but here op is another non-terminal, and the precedence of a rule is given by the last token occurring on the right-hand side of that rule. Instead of write op, you can write

expr:
  | op { ... }

op:
  | expr PLUS expr { ... }
  | expr MINUS expr { ... }

Understanding the precedence of a rule

Tiger does not have an end mark to its while loop, so it is pretty easy to get a shift/reduce conflict. Take this example which has a relational operation inside the body of a while.

95: shift/reduce conflict (shift 34, reduce 16) on PLUS
        expr : WHILE expr DO expr .  (16)
        op : expr . AND expr  (19)

In this example, we want an expr taking as long as it wants, so we want AND to shift. There are two methods to solve this problem. The easiest one is to add an END as part of the syntax of while. Or we let expr have higher precedence. The precedence of a rule is the last token occurring on the right-hand side of that rule, so it depends on DO. Therefore, defining %right for DO will solve the problem.