Tiger Compiler Notes: Normalize IR

2016-02-15
compiler

Now that we've gotten IR from the surface language, we're going to make it Canonical IR, and prepare for generating the program in assembly language.

Why is Canonical IR important? Canonical IR specifies which IR pattern is allowed and which isn’t. For any compiler implementation that has no canonical IR, it is too easy to introduce bugs for simple changes. Apparently the canonical IR here serves mainly as a way to ease assembly generation.

There are 3 passes in normalize IR phase:

  1. Linearize IR: remove ESEQ and SEQ. CALL is raised to a MOVE or EXP statement.
  2. Create Basic Blocks: create basic blocks from a series of instructions. The first instruction in a block is LABEL. The last instruction in a block is JUMP or CJUMP. There are no other labels or jumps in a basic block. A program becomes a directed graph after this pass.
  3. Trace blocks: rearrange basic blocks to make all CJUMP followed by the block containing the false branch. There are no JUMPs that are immediately followed by the block containing LABEL it jumps to.

It is pretty straightforward to normalize the IR. The philosophy behind the three passes is slightly vague.

The main reason for these passes is that target backends have no equivalent machine instructions, and thus IR have to degrade to its subset that can easily translate to machine instructions. Therefore, Normalize IR is to remove ESEQ and SEQ which has no equivalent machine instructions; trace is to preparing CJUMP to be able to fall through for false condition.

Still wonder why trace in the first place? Doesn’t the IR from basic block pass already follow each other? No. There is no guarentee that CJUMP will be followed by its false branch.

Trace must switch the true-branch and false-branch labels in CJUMP. Wonder why? Can’t you do it in the first place correctly?

No, it is not gaurenteed neither. Consider this:

...
   CJUMP(<=,
         MEM(BINOP(+,r7,-12)),
         MEM(BINOP(+,r7,-16)),
         true8,fi10)
fi10
    MOVE(r9,1)
    CJUMP(=,
          MEM(BINOP(+,r7,-4)),
          6,
          true11,false12)
false12
       MOVE(r9,0)
       JUMP(NAME(true11),true11)
true11
      MOVE(r10,CALL(NAME(assert),
                r9))
      ...
      JUMP(NAME(exit13),exit13)
true8
     JUMP(NAME(while_test6),while_test6)
while_test6
           CJUMP(<>,
                 1,
                 0,
                 while_body7,fi10)
while_body7
           MOVE(MEM(BINOP(+,r7,-4)),BINOP(+,MEM(BINOP(+,r7,-4)),1))
           CJUMP(<,
                 MEM(BINOP(+,r7,-12)),
                 MEM(BINOP(+,r7,-16)),
                 true3,while_done2)

Note that while_test6 is followed by its true-branch block while_body7, because fi10 has already been visited before while_test6, so it is impossible to make fi10 following while_test6. In this situation we have to reverse labels and its conditions of while_test6.

Here is another caveat:

true8
     JUMP(NAME(while_test6),while_test6)
while_test6
           CJUMP(<>,
                 1,
                 0,
                 while_body7,fi10)

This seems stupid. Can we join the two blocks by eliminating JUMP and while_test6 label? No. Because there is another block that also jumps to while_test6. The correct fix is to fix whoever jumps to true8 to jump to while_test6. Be aware that we can also remove the JUMP, but it breaks the invariant of basic blocks.

Note that in my implementation, I choose to break the invariant in the last pass. It’s better to eliminate the JUMP because such pattern will come again and again. For example

fi10
    MOVE(r9,1)
    CJUMP(=, MEM(BINOP(+,r7,-4)), 6, true11,false12)
false12
       MOVE(r9,0)
       JUMP(NAME(true11),true11)   <--- Eliminating this is easier.
true11
      MOVE(r10,CALL(NAME(assert), r9))