One goal of the tiger compiler is to learn how compiler generates assembly code from its IR. Before working on i386 backend, I’ve made tiger compiler to generate code for Sparc. Generating Sparc code is straightforward: I spent roughly one week to study the architecture before writing the code. I386 is way complicated (measured by the time I’ve spent on architecture manuals), so the compiler’s i386 backend deserves another blog post.
In this post, I’ll briefly walk through how gcc generates assembly code for i386 architecture, and then name a few subtle problems that I’ve encountered when I wrote the backend, and what I’ve learned along the way.
Study How GCC Generates Code
The most interesting part is probably generating code for function body, so let’s first look at how gcc does these things:
- Preserve previous stack information
- Reserve a new stack frame
- Refer parameters in the function body
- Pop current stack frame
Here is a sample code:
int sum3(int a, int b, int c)
{
return a + b + c;
}
int main()
{
sum3(1, 2, 3);
}
Generating assembly with gcc -m32 -O0 -S test.c
:
sum3:
.LFB0:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
addl %eax, %edx
movl 16(%ebp), %eax
addl %edx, %eax
movl %eax, -4(%ebp)
movl -4(%ebp), %eax
leave
ret
main:
.LFB1:
pushl %ebp
movl %esp, %ebp
pushl $3
pushl $2
pushl $1
call sum3
addl $12, %esp
movl $0, %eax
leave
ret
How is stacks information preserved
The actual control transfer occurs on call sum3
. Before the call, %ebp
points to some address ADDR
that defines the beginning of current stack frame.
+---------+
ADDR | ... | <--- EBP
+---------+
| 3 |
+---------+
| 2 |
+---------+
| 1 | <--- ESP
+---------+
Because call sum3
is equivalent to
push eip
jump sum3
Following instructions in sum3, we have the following traces
high call sum3 pushl %ebp movl %esp, %ebp subl $16, %esp
| +---+ +---+ +---+ +---+ +---+
| ADR|...|<-EBP |...|<-EBP |...|<-EBP |...| |...|
| +---+ +---+ +---+ +---+ +---+
| | 3 | | 3 | | 3 | | 3 | | 3 | EBP+16
| +---+ +---+ +---+ +---+ +---+-
| | 2 | | 2 | | 2 | | 2 | | 2 | EBP+12
| +---+ +---+ +---+- +---+ +---+
| | 1 |<-ESP | 1 | | 1 | | 1 | | 1 | EBP+8
| +---+ +---+ +---+ +---+ +---+-
| |EIP|<-ESP |EIP| |EIP| |EIP| EBP+4
| +---+ +---+ +---+ +---+
| |ADR|<-ESP |ADR|<-ESP,EBP |ADR|<-EBP
| +---+ +---+ +---+
| | | EBP-4
| +---+
| | | EBP-8
| +---+
| | | EBP-12
v +---+
low | |<-ESP
+---+
Notice that
- Function arguments reside in previous stack frame
- Previous EBP is pushed on the stack, pointed by current EBP
- Function local variables are between EBP and ESP
At this point, it is pretty clear that movl 8(%ebp), %edx
is to move 1 to %edx
, and movl 12(%ebp), %eax
is to move 2 to edx
. What about movl -4(%ebp), %eax
? That’s to move the final result of a+b+c
from a local variable (generated by the compiler) into EAX.
Pop the Stack Frame
After the computation, we’re about to return to previous location:
leave
ret
leave
instruction is to release the stack space used as local variables in a procedure, so it can be effectively interpreted as
movl %ebp, %esp
popl %ebp
After the leave
instruction, the stack state reverts back to the same state as call sum3
, and then ret
is equivalent to popl %eip
, which unwinds the stack state to what it was before the call sum3
, which has 3 arguments on the stack. To remove those arguments, it adds 12 to %esp
: addl $12,
%esp
.
Summary
So you can tell the calling process is just like playing a game with stack pointers (EBP and ESP):
- Caller pushes arguments onto its own frame: 3 arguments are pushed in our code
- Compiler generates code to reserves spaces for local variables in new frame: reserve 16 bytes in our code
- Because caller pushes arguments, its caller’s responsibility to remove these arguments: add 12 to ESP in our code
Compiler has God’s view of all local variables and arguments, so it knows how to refer to function arguments in previous stack frame.
Comparing with the SPARC architecture, the i386 style that to pass everything on stack looks slow.
Stack Alignment on macOS
What we’ve been talking in previous section is still not complete, but it’s sufficient for me to roll my own code generator at this stage. Now it’s pretty easy to roll your own code generator. Just pick up one calling convention and implement it faithfully.
In this section, I want to point out one subtle bug that takes me several hours to figure out.
Take the following assembly code as an example. The code is generated by the following program:
print("hello")
_tigermain
is the entry point of the program, _L3
is the function body, _exit2
is the epilogue. Function _print
is provided by the runtime that is linked into the final binary.
.global _tigermain
_tigermain:
push %ebp
movl %esp, %ebp
_L3:
subl $32, %esp
leal _str1, %eax
movl %eax, -4(%ebp)
mov -4(%ebp), %eax
pushl %eax
call _print
xor %eax, %eax
add $4, %esp
addl $32, %esp
jmp _exit2
_exit2:
popl %ebp
retl
.data
.globl _str1
_str1:
.long 5
.ascii "hello"
Notice it preserves 32 bytes on the stack, and before calling _print
, it pushes one argument on the stack, and restores ESP by adding 4 bytes afterwards. Then it adds 32 bytes to ESP and restores ESP.
All looks good. However, the same program works on Linux but doesn’t run on macOS. When I compile and run it on macOS, the following error occurs:
* thread #1: tid = 0x33d7ba, 0x960342f0 libdyld.dylib`misaligned_stack_error_, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x960342f0 libdyld.dylib`misaligned_stack_error_
libdyld.dylib`misaligned_stack_error_:
-> 0x960342f0 <+0>: movdqa %xmm0, 0x10(%esp)
0x960342f6 <+6>: movdqa %xmm1, 0x20(%esp)
0x960342fc <+12>: movdqa %xmm2, 0x30(%esp)
0x96034302 <+18>: movdqa %xmm3, 0x40(%esp)
Uhh, misaligned stack error?
Let’s see what assembly code LLVM would generate from similar functions?
extern void bar(int, int);
void foo(int x, int y) {
bar(x, y);
}
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 11
.globl _foo
.align 4, 0x90
_foo: ## @foo
## BB#0:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl 12(%ebp), %eax
movl 8(%ebp), %ecx
movl %ecx, -4(%ebp)
movl %eax, -8(%ebp)
movl -4(%ebp), %eax
movl -8(%ebp), %ecx
movl %eax, (%esp)
movl %ecx, 4(%esp)
calll _bar
addl $24, %esp
popl %ebp
retl
Looks similar. No? It took me a while to see the difference. If you’re not familiar with the 16-bytes align rule, the only difference you may see here is probably using movl
to push arguments, and subtract 24 from esp
. Uh, why 24? and it seems using pushl would also do the same work. At least that’s what I was thinking.
Of courses there are reasons for using movl.
As I mentioned, function call requires stack 16-byte aligned, upon the entry of a function, before pushl %ebp
, %eip
has been pushed (by call
instruction), so %esp
is always 0xbffff...c
(28 aligned). Then we push %ebp
, %esp
is always 0xbffff...8
(24 aligned). then subtract 24 from esp, we get 32-bytes aligned, and thus 16-bytes aligned, stack.
Because the frame starts out with esp aligned, so it can never use pushl
to push arguments for callee.
How can we decide ahead of time the size of the stack?
Say we have function foo calls bar which takes 10 integer arguments. Previously, we just need to use pushl
to these arguments on the stack and ESP would increase as needed. Now as we can’t use pushl
to automatically increase stack (to prevent unaligned ebp and esp), we have to decide how large the stack would be when entering function foo.
The size of stack was previously decided by the number of local variables (and spills). Now we have to consider all the call sites. An easy way to do is to just scan through all the call sites and reserve the maximum stack. So if a caller calls three functions like this
bar2(1, 2);
bar3(1, 2, 3);
bar4(1, 2, 3, 4);
esp needs to reserve 16 bytes for 4 int arguments, which is the minimum size. Actually when you compile the program you’ll see
pushl %esi
subl $52, %esp
movl $1, %eax
movl $2, %ecx
movl $1, (%esp)
movl $2, 4(%esp)
movl %eax, -8(%ebp) ## 4-byte Spill
movl %ecx, -12(%ebp) ## 4-byte Spill
calll _bar2
movl $1, %eax
movl $2, %ecx
movl $3, %edx
movl $1, (%esp)
movl $2, 4(%esp)
movl $3, 8(%esp)
movl %eax, -16(%ebp) ## 4-byte Spill
movl %ecx, -20(%ebp) ## 4-byte Spill
movl %edx, -24(%ebp) ## 4-byte Spill
calll _bar3
movl $1, %eax
movl $2, %ecx
movl $3, %edx
movl $4, %esi
movl $1, (%esp)
movl $2, 4(%esp)
movl $3, 8(%esp)
movl $4, 12(%esp)
movl %eax, -28(%ebp) ## 4-byte Spill
movl %ecx, -32(%ebp) ## 4-byte Spill
movl %edx, -36(%ebp) ## 4-byte Spill
movl %esi, -40(%ebp) ## 4-byte Spill
9 spills, each taking 4 bytes, 36 bytes in total. The largest number of arguments is from bar4
, each argument taking 4 bytes, 16 bytes in total. Thus total 52 bytes ((9+4)x4) are reserved on stack. Again stack is aligned because eip, ebp, esi has been pushed and the stack size sums up to 16x4 bytes, which is 16-byte aligned.
By the way, another thing you probably have seen now is that the caller doesn’t remove arguments from the stack for the callee, well, simply because we don’t do push
anymore and %esp
won’t change. We’re saving instructions!
Summary
Writing a native code generation is a rewarding project! By understanding how stack is manipulated by the architecture, I could do some bizarre things using the C compiler.
To actually make tiger language a real functional language, I need to make the language to pass a function as a value. But before that, I need a garbage collector to manage the memory. This would be a long journey, but it would be definitely a fun journey!