Let’s start with the following example that yields the control in the middle of a function, and resume the control afterwards: main yields to plus10, passing in 99, and plus10 yields and passes back the result of 99 + 10. main prints out the result before it yields to plus10 again.
The interesting part is that the control can resume to the middle of a function, which differentiates between a function call and a non-preemptive control transfer.
#include <boost/context/detail/fcontext.hpp>
#include <iostream>
#include <memory>
namespace ctx = boost::context::detail;
void plus10(ctx::transfer_t transfer) {
int64_t v = (int64_t)transfer.data + 10;
ctx::jump_fcontext(transfer.fctx, (void *)v);
std::cout << "f1: return with transfer.data " << (int64_t)transfer.data
<< std::endl;
}
int main(int argc, char *argv[]) {
const int size = 64;
void *stackTop = std::malloc(size);
void *stackBase = (unsigned char *)stackTop + size;
auto fiberContext = ctx::make_fcontext(stackBase, size, &plus10);
auto transfer = ctx::jump_fcontext(fiberContext, (void *)99);
std::cout << "answer: " << (int64_t)transfer.data << std::endl;
ctx::jump_fcontext(transfer.fctx, (void *)77);
std::cout << "main: return" << std::endl;
return EXIT_SUCCESS;
}
If you save the program to test.cc, you can comiple it with g++ -std=c++17 test.cc -lboost_context
Run it:
$ ./a.out answer: 109 f1: return with transfer.data 99
Quick Overview
Boost definitions
In boost 1.65.1, the relative fcontext definitions are
typedef void* fcontext_t;
struct transfer_t {
fcontext_t fctx;
void * data;
};
extern "C" BOOST_CONTEXT_DECL
transfer_t BOOST_CONTEXT_CALLDECL jump_fcontext( fcontext_t const to, void * vp);
extern "C" BOOST_CONTEXT_DECL
fcontext_t BOOST_CONTEXT_CALLDECL make_fcontext( void * sp, std::size_t size, void (* fn)( transfer_t) );
make_fcontext prepares the environment in the buffer sp to later transfer control to fn, returns fcontext_t for later use by jump_fcontext.
jump_fcontext uses the information saved in fcontext_t to transfer control to the point associated with fcontext_t. The point could be the beginning of the function, or a point where jump_fcontext was previously called. jump_fcontext passes in data vp to fn, storing inside the data field of transfer_t, which also contains the source context in transfer_t.fctx.
Make Context and Jump to Context
Let’s take a look at the program and see the control transfer flow.
auto fiberContext = ctx::make_fcontext(stackBase, size, &plus10);
make_fcontext prepares a context to transfer the control to function plus10, and fiber_context stores the context.
auto transfer = ctx::jump_fcontext(fiberContext, (void *)99);
jump_fcontext saves information of current environment, and then jumps into plus10, passing 99 as an extra data.
int64_t v = (int64_t)transfer.data + 10;
ctx::jump_fcontext(transfer.fctx, (void *)v);
Inside plus10, its argument transfer carries 99 in data field, and fctx field saves the point where the jump_fcontext was called. Then plus10 does the math before jumping back to main:
std::cout << "answer: " << (int64_t)transfer.data << std::endl;
ctx::jump_fcontext(transfer.fctx, (void *)77);
main function prints out 110 before initiating another jump. This jump resumes the execution of plus10, and then the program exits as plus10 returns.
Details
We know that it’s easy to ask a program to jump to a function: simply changing %rbp to point to the base of a stack, %rsp to the top of the stack and %rip to the text area would make it happen (call call or jmp). What needs a bit more work is to save and restore registers, and to decide where to save registers.
Let’s take a look at how make_fcontext is implemented in the macho platform.
In main, compiler as usual generates all code for calling make_fcontext, where %rdi is used to pass the first argument (sp) into callee, and %rsi to pass the second argument (size). See the commented assembly below:
callq 0x555555554980 <malloc@plt>
mov %rax,-0x28(%rbp)
mov -0x28(%rbp),%rax # stackTop = std::malloc(size)
add $0x40,%rax # stackBase = stackTop + 64
mov %rax,-0x20(%rbp)
mov -0x20(%rbp),%rax
lea -0xb0(%rip),%rdx # rdx third argument: plus10 address
mov $0x40,%esi # rsi second argument: size
mov %rax,%rdi # rdi first argument: stackBase
callq 0x555555554970 <make_fcontext@plt>
What does make_fcontext look like?
_make_fcontext:
/* first arg of make_fcontext() == top of context-stack */
movq %rdi, %rax
/* shift address in RAX to lower 16 byte boundary */
andq $-16, %rax
/* reserve space for context-data on context-stack */
/* on context-function entry: (RSP -0x8) % 16 == 0 */
leaq -0x40(%rax), %rax
/* third arg of make_fcontext() == address of context-function */
/* stored in RBX */
movq %rdx, 0x28(%rax)
/* save MMX control- and status-word */
stmxcsr (%rax)
/* save x87 control-word */
fnstcw 0x4(%rax)
/* compute abs address of label trampoline */
leaq trampoline(%rip), %rcx
/* save address of trampoline as return-address for context-function */
/* will be entered after calling jump_fcontext() first time */
movq %rcx, 0x38(%rax)
/* compute abs address of label finish */
leaq finish(%rip), %rcx
/* save address of finish as return-address for context-function */
/* will be entered after context-function returns */
movq %rcx, 0x30(%rax)
ret /* return pointer to context-data */
trampoline:
/* store return address on stack */
/* fix stack alignment */
push %rbp
/* jump to context-function */
jmp *%rbx
finish:
/* exit code is zero */
xorq %rdi, %rdi
/* exit application */
call __exit
hlt
Let’s take a look at each line:
- movq %rdi, %rax: %rax refers to the first argument sp.
- andq $-16, %rax: Align %rax because function call requires stack 16-byte aligned. I discussed the 16-byte alignment in Tiger Compiler Notes: I386 Backend too.
- leaq -0x40(%rax), %rax: this is equivalent to %rax -= 0x40 in other high level languages. This is 64 bytes (or eight 64bit) space reserved.
- movq %rdx, 0x28(%rax): saves address of plus10 at sp-3
- stmxcsr and fnstcw: uses the first 8bytes, which is sp-8
- movq %rcx, 0x38(%rax): saves the trampoline to sp-1
- movq %cx, 0x(%rax): saves the finish to sp-2
If we follow a debugging session, the memory layout looks like
High
|(+0x40) 0x555555768eb0 | # stackBase
|(+0x38) 0x555555768ea8 | save trampoline addr
|(+0x30) 0x555555768ea0 | save finish addr
|(+0x28) 0x555555768e98 | save plus10's address
|(+0x20) 0x555555768e90 |
|(+0x18) 0x555555768e88 |
|(+0x10) 0x555555768e80 |
|(+0x08) 0x555555768e78 |
| %rax 0x555555768e70 | stmxcsr; fnstcw # stackTop
v
Low
Or to show it concretely, we can print out the address in gdb
(gdb) p stackTop $3 = (void *) 0x555555768e70 (gdb) p stackBase $4 = (void *) 0x555555768eb0 (gdb) p plus10 $5 = {void (boost::context::detail::transfer_t)} 0x555555554aca <plus10(boost::context::detail::transfer_t)> (gdb) x/8g stackTop 0x555555768e70: 0x0000037f00001f80 0x0000000000000000 0x555555768e80: 0x0000000000000000 0x0000000000000000 0x555555768e90: 0x0000000000000000 0x0000555555554aca <= &plus10 0x555555768ea0: 0x00007ffff7bd2e5f 0x00007ffff7bd2e5c
What about the empty space between %rax+0x08 to %rax+0x20? We’ll see when it comes to jump_fcontext:
_jump_fcontext:
leaq -0x38(%rsp), %rsp /* prepare stack */
#if !defined(BOOST_USE_TSX)
stmxcsr (%rsp) /* save MMX control- and status-word */
fnstcw 0x4(%rsp) /* save x87 control-word */
#endif
movq %r12, 0x8(%rsp) /* save R12 */
movq %r13, 0x10(%rsp) /* save R13 */
movq %r14, 0x18(%rsp) /* save R14 */
movq %r15, 0x20(%rsp) /* save R15 */
movq %rbx, 0x28(%rsp) /* save RBX */
movq %rbp, 0x30(%rsp) /* save RBP */
/* store RSP (pointing to context-data) in RAX */
movq %rsp, %rax
/* restore RSP (pointing to context-data) from RDI */
movq %rdi, %rsp
movq 0x38(%rsp), %r8 /* restore return-address */
#if !defined(BOOST_USE_TSX)
ldmxcsr (%rsp) /* restore MMX control- and status-word */
fldcw 0x4(%rsp) /* restore x87 control-word */
#endif
movq 0x8(%rsp), %r12 /* restore R12 */
movq 0x10(%rsp), %r13 /* restore R13 */
movq 0x18(%rsp), %r14 /* restore R14 */
movq 0x20(%rsp), %r15 /* restore R15 */
movq 0x28(%rsp), %rbx /* restore RBX */
movq 0x30(%rsp), %rbp /* restore RBP */
leaq 0x40(%rsp), %rsp /* prepare stack */
/* return transfer_t from jump */
/* RAX == fctx, RDX == data */
movq %rsi, %rdx
/* pass transfer_t as first arg in context function */
/* RDI == fctx, RSI == data */
movq %rax, %rdi
/* indirect jump to context */
jmp *%r8
jump_context needs to store all information of the calling environment before jumping into plus10, that’s why we have
leaq -0x38(%rsp), %rsp
movq %r12, 0x8(%rsp) /* save R12 */
movq %r13, 0x10(%rsp) /* save R13 */
movq %r14, 0x18(%rsp) /* save R14 */
movq %r15, 0x20(%rsp) /* save R15 */
movq %rbx, 0x28(%rsp) /* save RBX */
movq %rbp, 0x30(%rsp) /* save RBP */
/* store RSP (pointing to context-data) in RAX */
movq %rsp, %rax
jump_fcontext saves registers on the stack of the caller; in this case, it expands main‘s stack by 0x38 bytes, and saves registers on the stack!
The next line is important
/* restore RSP (pointing to context-data) from RDI */
movq %rdi, %rsp
movq 0x38(%rsp), %r8 /* restore return-address */
....
jmp *%r8
Remember %rdi stores the first argument, which is fcontext_t for jump_fcontext . If we print it out
(gdb) p/x $rdi $30 = 0x555555768e70
This is exactly the bottom address of what we’ve seen in make_fcontext. Now %rsp points to that address! What is 0x38(%rsp) there? The trampoline address!
Alright, %rsp is in place.
Before jumping to the trampoline (and later plus10), what else do we need to do? We need to restore all other saved registers
movq 0x8(%rsp), %r12 /* restore R12 */
movq 0x10(%rsp), %r13 /* restore R13 */
movq 0x18(%rsp), %r14 /* restore R14 */
movq 0x20(%rsp), %r15 /* restore R15 */
movq 0x28(%rsp), %rbx /* restore RBX */
movq 0x30(%rsp), %rbp /* restore RBP */
leaq 0x40(%rsp), %rsp /* prepare stack */
/* return transfer_t from jump */
/* RAX == fctx, RDX == data */
movq %rsi, %rdx
/* pass transfer_t as first arg in context function */
/* RDI == fctx, RSI == data */
movq %rax, %rdi
Given the above code, now let’s look at the memory layout of the moment when make_fcontext returns:
High
|(+0x40) 0x555555768eb0 | # stackBase
|(+0x38) 0x555555768ea8 | save trampoline addr
|(+0x30) 0x555555768ea0 | save finish addr # rbp
|(+0x28) 0x555555768e98 | save foo's address # rbx
|(+0x20) 0x555555768e90 |
|(+0x18) 0x555555768e88 |
|(+0x10) 0x555555768e80 |
|(+0x08) 0x555555768e78 |
| %rax 0x555555768e70 | stmxcsr; fnstcw # stackTop
v
Low
%rbx points to foo address. %rbp points to finish. %rsp points to stackTop. %rdi points to transfer_t.
When jump_fcontext jumps to trampoline
/* store return address on stack */
/* fix stack alignment */
push %rbp
/* jump to context-function */
jmp *%rbx
trampoline pushes %rbp, which points to finish so when ret executes later, finish will be restored into %rip, and jumps to %rbx, which points to plus10.
Now we know why the program just exits when plus10 exits because the ret pops into %rip the address of finish, which calls __exit.
Discussion
You may have noticed that
- We pass in stackTop + size instead of stackTop in make_fcontext
- The program doesn’t exit in main; it doesn’t print “main: return”
- The second jump tries to pass in 77, but plus10 resumes to print 99 in transfer.data
Stack pointer
make_fcontext has the following instruction to reserve the stack space:
leaq -0x40(%rax), %rax
The %rax is the argument sp.
Remember the memory grows from high to low? The %rax here is actually the base of the stack (high address), not the top of the stack (low address).
That’s why the code should use sp+size to reach the high address instead of passing in sp.
auto sp = std::malloc(size);
auto transfer = ctx::make_fcontext(sp+size, size, fn);
Stack Size
In our example, the stack size is only 64bytes, because plus10 doesn’t need a large stack. However, it is possible to overflow the preallocated heap if the function that the control transfer to has a lot of local variables.
Compiler will generate code to reserve space inside plus10 for all local variables of plus10. If stackBase is not large enough, the generated plus10 will cause overflow on the artifical stack stackBase.
Why do I call it artifical stack? Look at the stacktrace when jump_fcontext is called in plus10:
(gdb) bt #0 0x00007ffff7bd2e70 in jump_fcontext () from /usr/lib/x86_64-linux-gnu/libboost_context.so.1.65.1 #1 0x0000555555554b02 in plus10 (transfer=...) at simple.cc:9 #2 0x00007ffff7bd2e5f in make_fcontext () from /usr/lib/x86_64-linux-gnu/libboost_context.so.1.65.1 #3 0x0000000000000000 in ?? ()
Do you know why the bottom of the stack is not main? Usually %rbp points to previous %rbp so recursively lookup the %rbp in memory can find all stacks. Since we passed in make_fcontext an address on the heap, %rbp is pointing to the heap, and on heap the memory is all 0. That’s why you see the bottom of the stack is 0x0000000000000000.
Resume to Previous jump_fcontext
When we passed in 77, why did plus10 exit with the message:
f1: return with transfer.data 99
The explanation is easy, because the reference of transfer.data is on the stack.
Compiler has transformed this code
std::cout << "f1: return with transfer.data " << (int64_t)transfer.data
<< std::endl;
to
0x0000555555554b02 <+56>: lea 0x21f(%rip),%rsi # 0x555555554d28
0x0000555555554b09 <+63>: lea 0x201510(%rip),%rdi # 0x555555756020 <_ZSt4cout@@GLIBCXX_3.4>
0x0000555555554b10 <+70>: callq 0x555555554950 <_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt>
0x0000555555554b15 <+75>: mov %rax,%rdx
0x0000555555554b18 <+78>: mov -0x18(%rbp),%rax
0x0000555555554b1c <+82>: mov %rax,%rsi
0x0000555555554b1f <+85>: mov %rdx,%rdi
0x0000555555554b22 <+88>: callq 0x5555555549a0 <_ZNSolsEl@plt>
where -0x18(%rbp) is the reference for transfer.data. Data on the stack stays the same between jumps.
Speed
You may be like me, wondering how fast this would be at this moment.
Transfering the control 1 billion times using the following program takes about 7s (roughly 150 million times per second) on my old laptop which has 2.5GHz i5-3210M Dual-core processor.
#include <boost/context/detail/fcontext.hpp>
#include <iostream>
#include <memory>
namespace ctx = boost::context::detail;
void plus10(ctx::transfer_t transfer) {
uint64_t count = 0;
while (count < 1000000000L) {
count += 1;
ctx::jump_fcontext(transfer.fctx, nullptr);
}
std::cout << "Count is " << count << std::endl;
}
int main(int argc, char *argv[]) {
const int size = 64;
void *stackTop = std::malloc(size);
void *stackBase = (unsigned char *)stackTop + size;
auto fiberContext = ctx::make_fcontext(stackBase, size, &plus10);
auto transfer = ctx::jump_fcontext(fiberContext, nullptr);
while (true) {
transfer = ctx::jump_fcontext(transfer.fctx, nullptr);
}
std::cout << "main: return" << std::endl;
return EXIT_SUCCESS;
}
Summary
The implementation of non-preempty context switch is a well designed memory layout manipulation.
During make_fcontext, the memory is laid out such that jump_fcontext saves all registers on the stack of caller (main in our case) and jumps to trampoline, which then jumps to the target function plus10. When jumps back from the target function, jump_fcontext saves all registers on the stack of plus10, restores previous registers to continue to run main.