Tiger Compiler Notes: I386 Backend

One goal of the tiger compiler is to learn how compiler generates assembly code from its IR. Before working on i386 backend, I’ve made tiger compiler to generate code for Sparc. Generating Sparc code is straightforward: I spent roughly one week to study the architecture before writing the code. I386 is way complicated (measured by the time I’ve spent on architecture manuals), so the compiler’s i386 backend deserves another blog post.

In this post, I’ll briefly walk through how gcc generates assembly code for i386 architecture, and then name a few subtle problems that I’ve encountered when I wrote the backend, and what I’ve learned along the way.

Study How GCC Generates Code

The most interesting part is probably generating code for function body, so let’s first look at how gcc does these things:

Preserve previous stack information
Reserve a new stack frame
Refer parameters in the function body
Pop current stack frame

Here is a sample code:

int sum3(int a, int b, int c)
{
        return a + b + c;
}

int main()
{
        sum3(1, 2, 3);
}

Generating assembly with gcc -m32 -O0 -S test.c:

sum3:
.LFB0:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $16, %esp
        movl    8(%ebp), %edx
        movl    12(%ebp), %eax
        addl    %eax, %edx
        movl    16(%ebp), %eax
        addl    %edx, %eax
        movl    %eax, -4(%ebp)
        movl    -4(%ebp), %eax
        leave
        ret
main:
.LFB1:
        pushl	%ebp
        movl	%esp, %ebp
        pushl	$3
        pushl	$2
        pushl	$1
        call	sum3
        addl	$12, %esp
        movl	$0, %eax
        leave
        ret

How is stacks information preserved

The actual control transfer occurs on call sum3. Before the call, %ebp points to some address ADDR that defines the beginning of current stack frame.

+---------+
ADDR |  ...    |  <--- EBP
     +---------+
     |    3    |
     +---------+
     |    2    |
     +---------+
     |    1    |  <--- ESP
     +---------+

Because call sum3 is equivalent to

push eip
jump sum3

Following instructions in sum3, we have the following traces

high                call sum3       pushl %ebp      movl %esp, %ebp     subl $16, %esp
 |      +---+          +---+           +---+           +---+               +---+
 |   ADR|...|<-EBP     |...|<-EBP      |...|<-EBP      |...|               |...|
 |      +---+          +---+           +---+           +---+               +---+
 |      | 3 |          | 3 |           | 3 |           | 3 |               | 3 | EBP+16
 |      +---+          +---+           +---+           +---+               +---+-
 |      | 2 |          | 2 |           | 2 |           | 2 |               | 2 | EBP+12
 |      +---+          +---+           +---+-          +---+               +---+
 |      | 1 |<-ESP     | 1 |           | 1 |           | 1 |               | 1 | EBP+8
 |      +---+          +---+           +---+           +---+               +---+-
 |                     |EIP|<-ESP      |EIP|           |EIP|               |EIP| EBP+4
 |                     +---+           +---+           +---+               +---+
 |                                     |ADR|<-ESP      |ADR|<-ESP,EBP      |ADR|<-EBP
 |                                     +---+           +---+               +---+
 |                                                                         |   | EBP-4
 |                                                                         +---+
 |                                                                         |   | EBP-8
 |                                                                         +---+
 |                                                                         |   | EBP-12
 v                                                                         +---+
low                                                                        |   |<-ESP
                                                                           +---+

Notice that

Function arguments reside in previous stack frame
Previous EBP is pushed on the stack, pointed by current EBP
Function local variables are between EBP and ESP

At this point, it is pretty clear that movl 8(%ebp), %edx is to move 1 to %edx, and movl 12(%ebp), %eax is to move 2 to edx. What about movl -4(%ebp), %eax? That’s to move the final result of a+b+c from a local variable (generated by the compiler) into EAX.

Pop the Stack Frame

After the computation, we’re about to return to previous location:

leave
ret

leave instruction is to release the stack space used as local variables in a procedure, so it can be effectively interpreted as

movl %ebp, %esp
popl %ebp

After the leave instruction, the stack state reverts back to the same state as call sum3, and then ret is equivalent to popl %eip, which unwinds the stack state to what it was before the call sum3, which has 3 arguments on the stack. To remove those arguments, it adds 12 to %esp: addl $12, %esp.

Summary

So you can tell the calling process is just like playing a game with stack pointers (EBP and ESP):

Caller pushes arguments onto its own frame: 3 arguments are pushed in our code
Compiler generates code to reserves spaces for local variables in new frame: reserve 16 bytes in our code
Because caller pushes arguments, its caller’s responsibility to remove these arguments: add 12 to ESP in our code

Compiler has God’s view of all local variables and arguments, so it knows how to refer to function arguments in previous stack frame.

Comparing with the SPARC architecture, the i386 style that to pass everything on stack looks slow.

Stack Alignment on macOS

What we’ve been talking in previous section is still not complete, but it’s sufficient for me to roll my own code generator at this stage. Now it’s pretty easy to roll your own code generator. Just pick up one calling convention and implement it faithfully.

In this section, I want to point out one subtle bug that takes me several hours to figure out.

Take the following assembly code as an example. The code is generated by the following program:

print("hello")

_tigermain is the entry point of the program, _L3 is the function body, _exit2 is the epilogue. Function _print is provided by the runtime that is linked into the final binary.

.global _tigermain
_tigermain:
    push %ebp
    movl %esp, %ebp
_L3:
    subl $32, %esp
    leal _str1, %eax
    movl %eax, -4(%ebp)
    mov -4(%ebp), %eax
    pushl %eax
    call _print
    xor %eax, %eax
    add $4, %esp
    addl $32, %esp
    jmp _exit2
_exit2:
    popl %ebp
    retl
.data
.globl _str1
_str1:
    .long 5
    .ascii "hello"

Notice it preserves 32 bytes on the stack, and before calling _print, it pushes one argument on the stack, and restores ESP by adding 4 bytes afterwards. Then it adds 32 bytes to ESP and restores ESP.

All looks good. However, the same program works on Linux but doesn’t run on macOS. When I compile and run it on macOS, the following error occurs:

* thread #1: tid = 0x33d7ba, 0x960342f0 libdyld.dylib`misaligned_stack_error_, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x960342f0 libdyld.dylib`misaligned_stack_error_
libdyld.dylib`misaligned_stack_error_:
->  0x960342f0 <+0>:  movdqa %xmm0, 0x10(%esp)
    0x960342f6 <+6>:  movdqa %xmm1, 0x20(%esp)
    0x960342fc <+12>: movdqa %xmm2, 0x30(%esp)
    0x96034302 <+18>: movdqa %xmm3, 0x40(%esp)

Uhh, misaligned stack error?

Let’s see what assembly code LLVM would generate from similar functions?

extern void bar(int, int);
void foo(int x, int y) {
      bar(x, y);
}

.section    __TEXT,__text,regular,pure_instructions
  .macosx_version_min 10, 11
  .globl  _foo
  .align  4, 0x90
_foo:                                   ## @foo
## BB#0:
  pushl   %ebp
  movl    %esp, %ebp
  subl    $24, %esp
  movl    12(%ebp), %eax
  movl    8(%ebp), %ecx
  movl    %ecx, -4(%ebp)
  movl    %eax, -8(%ebp)
  movl    -4(%ebp), %eax
  movl    -8(%ebp), %ecx
  movl    %eax, (%esp)
  movl    %ecx, 4(%esp)
  calll   _bar
  addl    $24, %esp
  popl    %ebp
  retl

Looks similar. No? It took me a while to see the difference. If you’re not familiar with the 16-bytes align rule, the only difference you may see here is probably using movl to push arguments, and subtract 24 from esp. Uh, why 24? and it seems using pushl would also do the same work. At least that’s what I was thinking.

Of courses there are reasons for using movl.

As I mentioned, function call requires stack 16-byte aligned, upon the entry of a function, before pushl %ebp, %eip has been pushed (by call instruction), so %esp is always 0xbffff...c (28 aligned). Then we push %ebp, %esp is always 0xbffff...8 (24 aligned). then subtract 24 from esp, we get 32-bytes aligned, and thus 16-bytes aligned, stack.

Because the frame starts out with esp aligned, so it can never use pushl to push arguments for callee.

How can we decide ahead of time the size of the stack?

Say we have function foo calls bar which takes 10 integer arguments. Previously, we just need to use pushl to these arguments on the stack and ESP would increase as needed. Now as we can’t use pushl to automatically increase stack (to prevent unaligned ebp and esp), we have to decide how large the stack would be when entering function foo.

The size of stack was previously decided by the number of local variables (and spills). Now we have to consider all the call sites. An easy way to do is to just scan through all the call sites and reserve the maximum stack. So if a caller calls three functions like this

bar2(1, 2);
bar3(1, 2, 3);
bar4(1, 2, 3, 4);

esp needs to reserve 16 bytes for 4 int arguments, which is the minimum size. Actually when you compile the program you’ll see

pushl   %esi
subl    $52, %esp
movl    $1, %eax
movl    $2, %ecx
movl    $1, (%esp)
movl    $2, 4(%esp)
movl    %eax, -8(%ebp)          ## 4-byte Spill
movl    %ecx, -12(%ebp)         ## 4-byte Spill
calll   _bar2
movl    $1, %eax
movl    $2, %ecx
movl    $3, %edx
movl    $1, (%esp)
movl    $2, 4(%esp)
movl    $3, 8(%esp)
movl    %eax, -16(%ebp)         ## 4-byte Spill
movl    %ecx, -20(%ebp)         ## 4-byte Spill
movl    %edx, -24(%ebp)         ## 4-byte Spill
calll   _bar3
movl    $1, %eax
movl    $2, %ecx
movl    $3, %edx
movl    $4, %esi
movl    $1, (%esp)
movl    $2, 4(%esp)
movl    $3, 8(%esp)
movl    $4, 12(%esp)
movl    %eax, -28(%ebp)         ## 4-byte Spill
movl    %ecx, -32(%ebp)         ## 4-byte Spill
movl    %edx, -36(%ebp)         ## 4-byte Spill
movl    %esi, -40(%ebp)         ## 4-byte Spill

9 spills, each taking 4 bytes, 36 bytes in total. The largest number of arguments is from bar4, each argument taking 4 bytes, 16 bytes in total. Thus total 52 bytes ((9+4)x4) are reserved on stack. Again stack is aligned because eip, ebp, esi has been pushed and the stack size sums up to 16x4 bytes, which is 16-byte aligned.

By the way, another thing you probably have seen now is that the caller doesn’t remove arguments from the stack for the callee, well, simply because we don’t do push anymore and %esp won’t change. We’re saving instructions!

Summary

Writing a native code generation is a rewarding project! By understanding how stack is manipulated by the architecture, I could do some bizarre things using the C compiler.

To actually make tiger language a real functional language, I need to make the language to pass a function as a value. But before that, I need a garbage collector to manage the memory. This would be a long journey, but it would be definitely a fun journey!