Notes on Folly: Cooperative Context Switch

2019-09-12
c++

Folly implements fibers using boost fcontext. To understand folly::fibers, we'll need to understand boost fcontext first. This post explores how boost implements non-preemptive control transfers.

Let’s start with the following example that yields the control in the middle of a function, and resume the control afterwards: main yields to plus10, passing in 99, and plus10 yields and passes back the result of 99 + 10. main prints out the result before it yields to plus10 again.

The interesting part is that the control can resume to the middle of a function, which differentiates between a function call and a non-preemptive control transfer.

#include <boost/context/detail/fcontext.hpp>
#include <iostream>
#include <memory>

namespace ctx = boost::context::detail;

void plus10(ctx::transfer_t transfer) {
  int64_t v = (int64_t)transfer.data + 10;
  ctx::jump_fcontext(transfer.fctx, (void *)v);
  std::cout << "f1: return with transfer.data " << (int64_t)transfer.data
            << std::endl;
}

int main(int argc, char *argv[]) {
  const int size = 64;
  void *stackTop = std::malloc(size);
  void *stackBase = (unsigned char *)stackTop + size;

  auto fiberContext = ctx::make_fcontext(stackBase, size, &plus10);

  auto transfer = ctx::jump_fcontext(fiberContext, (void *)99);
  std::cout << "answer: " << (int64_t)transfer.data << std::endl;

  ctx::jump_fcontext(transfer.fctx, (void *)77);

  std::cout << "main: return" << std::endl;
  return EXIT_SUCCESS;
}

If you save the program to test.cc, you can comiple it with g++ -std=c++17 test.cc -lboost_context

Run it:

$ ./a.out
answer: 109
f1: return with transfer.data 99

Quick Overview

Boost definitions

In boost 1.65.1, the relative fcontext definitions are

typedef void*   fcontext_t;

struct transfer_t {
    fcontext_t  fctx;
    void    *   data;
};

extern "C" BOOST_CONTEXT_DECL
transfer_t BOOST_CONTEXT_CALLDECL jump_fcontext( fcontext_t const to, void * vp);
extern "C" BOOST_CONTEXT_DECL
fcontext_t BOOST_CONTEXT_CALLDECL make_fcontext( void * sp, std::size_t size, void (* fn)( transfer_t) );

make_fcontext prepares the environment in the buffer sp to later transfer control to fn, returns fcontext_t for later use by jump_fcontext.

jump_fcontext uses the information saved in fcontext_t to transfer control to the point associated with fcontext_t. The point could be the beginning of the function, or a point where jump_fcontext was previously called. jump_fcontext passes in data vp to fn, storing inside the data field of transfer_t, which also contains the source context in transfer_t.fctx.

Make Context and Jump to Context

Let’s take a look at the program and see the control transfer flow.

auto fiberContext = ctx::make_fcontext(stackBase, size, &plus10);

make_fcontext prepares a context to transfer the control to function plus10, and fiber_context stores the context.

auto transfer = ctx::jump_fcontext(fiberContext, (void *)99);

jump_fcontext saves information of current environment, and then jumps into plus10, passing 99 as an extra data.

int64_t v = (int64_t)transfer.data + 10;
ctx::jump_fcontext(transfer.fctx, (void *)v);

Inside plus10, its argument transfer carries 99 in data field, and fctx field saves the point where the jump_fcontext was called. Then plus10 does the math before jumping back to main:

std::cout << "answer: " << (int64_t)transfer.data << std::endl;
ctx::jump_fcontext(transfer.fctx, (void *)77);

main function prints out 110 before initiating another jump. This jump resumes the execution of plus10, and then the program exits as plus10 returns.

Details

We know that it’s easy to ask a program to jump to a function: simply changing %rbp to point to the base of a stack, %rsp to the top of the stack and %rip to the text area would make it happen (call call or jmp). What needs a bit more work is to save and restore registers, and to decide where to save registers.

Let’s take a look at how make_fcontext is implemented in the macho platform.

You can find the source code for your platform at https://github.com/boostorg/context/tree/boost-1.65.1/src/asm

In main, compiler as usual generates all code for calling make_fcontext, where %rdi is used to pass the first argument (sp) into callee, and %rsi to pass the second argument (size). See the commented assembly below:

callq  0x555555554980 <malloc@plt>
mov    %rax,-0x28(%rbp)
mov    -0x28(%rbp),%rax        # stackTop = std::malloc(size)
add    $0x40,%rax              # stackBase = stackTop + 64
mov    %rax,-0x20(%rbp)
mov    -0x20(%rbp),%rax
lea    -0xb0(%rip),%rdx        # rdx third argument: plus10 address
mov    $0x40,%esi              # rsi second argument: size
mov    %rax,%rdi               # rdi first argument: stackBase
callq  0x555555554970 <make_fcontext@plt>
The calling convention is: first six arguments are in rdi, rsi, rdx, rcx, r8d, r9d. Remaining arguments are on the stack.

What does make_fcontext look like?

_make_fcontext:
    /* first arg of make_fcontext() == top of context-stack */
    movq  %rdi, %rax

    /* shift address in RAX to lower 16 byte boundary */
    andq  $-16, %rax

    /* reserve space for context-data on context-stack */
    /* on context-function entry: (RSP -0x8) % 16 == 0 */
    leaq  -0x40(%rax), %rax

    /* third arg of make_fcontext() == address of context-function */
    /* stored in RBX */
    movq  %rdx, 0x28(%rax)

    /* save MMX control- and status-word */
    stmxcsr  (%rax)
    /* save x87 control-word */
    fnstcw   0x4(%rax)

    /* compute abs address of label trampoline */
    leaq  trampoline(%rip), %rcx
    /* save address of trampoline as return-address for context-function */
    /* will be entered after calling jump_fcontext() first time */
    movq  %rcx, 0x38(%rax)

    /* compute abs address of label finish */
    leaq  finish(%rip), %rcx
    /* save address of finish as return-address for context-function */
    /* will be entered after context-function returns */
    movq  %rcx, 0x30(%rax)

    ret /* return pointer to context-data */

trampoline:
    /* store return address on stack */
    /* fix stack alignment */
    push %rbp
    /* jump to context-function */
    jmp *%rbx

finish:
    /* exit code is zero */
    xorq  %rdi, %rdi
    /* exit application */
    call  __exit
    hlt

Let’s take a look at each line:

  • movq %rdi, %rax: %rax refers to the first argument sp.
  • andq $-16, %rax: Align %rax because function call requires stack 16-byte aligned. I discussed the 16-byte alignment in Tiger Compiler Notes: I386 Backend too.
  • leaq -0x40(%rax), %rax: this is equivalent to %rax -= 0x40 in other high level languages. This is 64 bytes (or eight 64bit) space reserved.
  • movq %rdx, 0x28(%rax): saves address of plus10 at sp-3
  • stmxcsr and fnstcw: uses the first 8bytes, which is sp-8
  • movq %rcx, 0x38(%rax): saves the trampoline to sp-1
  • movq %cx, 0x(%rax): saves the finish to sp-2

If we follow a debugging session, the memory layout looks like

High
 |(+0x40) 0x555555768eb0  |                     # stackBase
 |(+0x38) 0x555555768ea8  | save trampoline addr
 |(+0x30) 0x555555768ea0  | save finish addr
 |(+0x28) 0x555555768e98  | save plus10's address
 |(+0x20) 0x555555768e90  |
 |(+0x18) 0x555555768e88  |
 |(+0x10) 0x555555768e80  |
 |(+0x08) 0x555555768e78  |
 |  %rax  0x555555768e70  | stmxcsr; fnstcw     # stackTop
 v
Low

Or to show it concretely, we can print out the address in gdb

(gdb) p stackTop
$3 = (void *) 0x555555768e70
(gdb) p stackBase
$4 = (void *) 0x555555768eb0
(gdb) p plus10
$5 = {void (boost::context::detail::transfer_t)} 0x555555554aca <plus10(boost::context::detail::transfer_t)>
(gdb) x/8g stackTop
0x555555768e70: 0x0000037f00001f80      0x0000000000000000
0x555555768e80: 0x0000000000000000      0x0000000000000000
0x555555768e90: 0x0000000000000000      0x0000555555554aca   <= &plus10
0x555555768ea0: 0x00007ffff7bd2e5f      0x00007ffff7bd2e5c

What about the empty space between %rax+0x08 to %rax+0x20? We’ll see when it comes to jump_fcontext:

_jump_fcontext:
    leaq  -0x38(%rsp), %rsp /* prepare stack */

#if !defined(BOOST_USE_TSX)
    stmxcsr  (%rsp)     /* save MMX control- and status-word */
    fnstcw   0x4(%rsp)  /* save x87 control-word */
#endif

    movq  %r12, 0x8(%rsp)  /* save R12 */
    movq  %r13, 0x10(%rsp)  /* save R13 */
    movq  %r14, 0x18(%rsp)  /* save R14 */
    movq  %r15, 0x20(%rsp)  /* save R15 */
    movq  %rbx, 0x28(%rsp)  /* save RBX */
    movq  %rbp, 0x30(%rsp)  /* save RBP */

    /* store RSP (pointing to context-data) in RAX */
    movq  %rsp, %rax

    /* restore RSP (pointing to context-data) from RDI */
    movq  %rdi, %rsp

    movq  0x38(%rsp), %r8  /* restore return-address */

#if !defined(BOOST_USE_TSX)
    ldmxcsr  (%rsp)     /* restore MMX control- and status-word */
    fldcw    0x4(%rsp)  /* restore x87 control-word */
#endif

    movq  0x8(%rsp), %r12  /* restore R12 */
    movq  0x10(%rsp), %r13  /* restore R13 */
    movq  0x18(%rsp), %r14  /* restore R14 */
    movq  0x20(%rsp), %r15  /* restore R15 */
    movq  0x28(%rsp), %rbx  /* restore RBX */
    movq  0x30(%rsp), %rbp  /* restore RBP */

    leaq  0x40(%rsp), %rsp /* prepare stack */

    /* return transfer_t from jump */
    /* RAX == fctx, RDX == data */
    movq  %rsi, %rdx
    /* pass transfer_t as first arg in context function */
    /* RDI == fctx, RSI == data */
    movq  %rax, %rdi

    /* indirect jump to context */
    jmp  *%r8

jump_context needs to store all information of the calling environment before jumping into plus10, that’s why we have

leaq  -0x38(%rsp), %rsp

movq  %r12, 0x8(%rsp)  /* save R12 */
movq  %r13, 0x10(%rsp)  /* save R13 */
movq  %r14, 0x18(%rsp)  /* save R14 */
movq  %r15, 0x20(%rsp)  /* save R15 */
movq  %rbx, 0x28(%rsp)  /* save RBX */
movq  %rbp, 0x30(%rsp)  /* save RBP */

/* store RSP (pointing to context-data) in RAX */
movq  %rsp, %rax

jump_fcontext saves registers on the stack of the caller; in this case, it expands main‘s stack by 0x38 bytes, and saves registers on the stack!

The next line is important

/* restore RSP (pointing to context-data) from RDI */
movq  %rdi, %rsp

movq  0x38(%rsp), %r8  /* restore return-address */
....
jmp *%r8
First six arguments are in rdi, rsi, rdx, rcx, r8d, r9d. Remaining arguments are on the stack.

Remember %rdi stores the first argument, which is fcontext_t for jump_fcontext . If we print it out

(gdb) p/x $rdi
$30 = 0x555555768e70

This is exactly the bottom address of what we’ve seen in make_fcontext. Now %rsp points to that address! What is 0x38(%rsp) there? The trampoline address!

Alright, %rsp is in place.

Before jumping to the trampoline (and later plus10), what else do we need to do? We need to restore all other saved registers

movq  0x8(%rsp), %r12  /* restore R12 */
movq  0x10(%rsp), %r13  /* restore R13 */
movq  0x18(%rsp), %r14  /* restore R14 */
movq  0x20(%rsp), %r15  /* restore R15 */
movq  0x28(%rsp), %rbx  /* restore RBX */
movq  0x30(%rsp), %rbp  /* restore RBP */

leaq  0x40(%rsp), %rsp /* prepare stack */

/* return transfer_t from jump */
/* RAX == fctx, RDX == data */
movq  %rsi, %rdx
/* pass transfer_t as first arg in context function */
/* RDI == fctx, RSI == data */
movq  %rax, %rdi

Given the above code, now let’s look at the memory layout of the moment when make_fcontext returns:

High
 |(+0x40) 0x555555768eb0  |                     # stackBase
 |(+0x38) 0x555555768ea8  | save trampoline addr
 |(+0x30) 0x555555768ea0  | save finish addr    # rbp
 |(+0x28) 0x555555768e98  | save foo's address  # rbx
 |(+0x20) 0x555555768e90  |
 |(+0x18) 0x555555768e88  |
 |(+0x10) 0x555555768e80  |
 |(+0x08) 0x555555768e78  |
 |  %rax  0x555555768e70  | stmxcsr; fnstcw     # stackTop
 v
Low

%rbx points to foo address. %rbp points to finish. %rsp points to stackTop. %rdi points to transfer_t.

When jump_fcontext jumps to trampoline

/* store return address on stack */
/* fix stack alignment */
push %rbp
/* jump to context-function */
jmp *%rbx

trampoline pushes %rbp, which points to finish so when ret executes later, finish will be restored into %rip, and jumps to %rbx, which points to plus10.

Now we know why the program just exits when plus10 exits because the ret pops into %rip the address of finish, which calls __exit.

Discussion

You may have noticed that

  1. We pass in stackTop + size instead of stackTop in make_fcontext
  2. The program doesn’t exit in main; it doesn’t print “main: return”
  3. The second jump tries to pass in 77, but plus10 resumes to print 99 in transfer.data

Stack pointer

make_fcontext has the following instruction to reserve the stack space:

leaq -0x40(%rax), %rax

The %rax is the argument sp.

Remember the memory grows from high to low? The %rax here is actually the base of the stack (high address), not the top of the stack (low address).

That’s why the code should use sp+size to reach the high address instead of passing in sp.

auto sp = std::malloc(size);
auto transfer = ctx::make_fcontext(sp+size, size, fn);

Stack Size

In our example, the stack size is only 64bytes, because plus10 doesn’t need a large stack. However, it is possible to overflow the preallocated heap if the function that the control transfer to has a lot of local variables.

Compiler will generate code to reserve space inside plus10 for all local variables of plus10. If stackBase is not large enough, the generated plus10 will cause overflow on the artifical stack stackBase.

Why do I call it artifical stack? Look at the stacktrace when jump_fcontext is called in plus10:

(gdb) bt
#0  0x00007ffff7bd2e70 in jump_fcontext () from /usr/lib/x86_64-linux-gnu/libboost_context.so.1.65.1
#1  0x0000555555554b02 in plus10 (transfer=...) at simple.cc:9
#2  0x00007ffff7bd2e5f in make_fcontext () from /usr/lib/x86_64-linux-gnu/libboost_context.so.1.65.1
#3  0x0000000000000000 in ?? ()
If you have forgotten the memory layout of a generated function, a quick skim of Tiger Compiler Notes: I386 Backend might ring a bell for you.

Do you know why the bottom of the stack is not main? Usually %rbp points to previous %rbp so recursively lookup the %rbp in memory can find all stacks. Since we passed in make_fcontext an address on the heap, %rbp is pointing to the heap, and on heap the memory is all 0. That’s why you see the bottom of the stack is 0x0000000000000000.

Resume to Previous jump_fcontext

When we passed in 77, why did plus10 exit with the message:

f1: return with transfer.data 99

The explanation is easy, because the reference of transfer.data is on the stack.

Compiler has transformed this code

std::cout << "f1: return with transfer.data " << (int64_t)transfer.data
          << std::endl;

to

0x0000555555554b02 <+56>:    lea    0x21f(%rip),%rsi        # 0x555555554d28
0x0000555555554b09 <+63>:    lea    0x201510(%rip),%rdi        # 0x555555756020 <_ZSt4cout@@GLIBCXX_3.4>
0x0000555555554b10 <+70>:    callq  0x555555554950 <_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt>
0x0000555555554b15 <+75>:    mov    %rax,%rdx
0x0000555555554b18 <+78>:    mov    -0x18(%rbp),%rax
0x0000555555554b1c <+82>:    mov    %rax,%rsi
0x0000555555554b1f <+85>:    mov    %rdx,%rdi
0x0000555555554b22 <+88>:    callq  0x5555555549a0 <_ZNSolsEl@plt>

where -0x18(%rbp) is the reference for transfer.data. Data on the stack stays the same between jumps.

Speed

You may be like me, wondering how fast this would be at this moment.

Transfering the control 1 billion times using the following program takes about 7s (roughly 150 million times per second) on my old laptop which has 2.5GHz i5-3210M Dual-core processor.

#include <boost/context/detail/fcontext.hpp>
#include <iostream>
#include <memory>

namespace ctx = boost::context::detail;

void plus10(ctx::transfer_t transfer) {
  uint64_t count = 0;
  while (count < 1000000000L) {
          count += 1;
          ctx::jump_fcontext(transfer.fctx, nullptr);
  }
  std::cout << "Count is " << count << std::endl;
}

int main(int argc, char *argv[]) {
  const int size = 64;
  void *stackTop = std::malloc(size);
  void *stackBase = (unsigned char *)stackTop + size;

  auto fiberContext = ctx::make_fcontext(stackBase, size, &plus10);

  auto transfer = ctx::jump_fcontext(fiberContext, nullptr);

  while (true) {
          transfer = ctx::jump_fcontext(transfer.fctx, nullptr);
  }

  std::cout << "main: return" << std::endl;
  return EXIT_SUCCESS;
}

Summary

The implementation of non-preempty context switch is a well designed memory layout manipulation.

During make_fcontext, the memory is laid out such that jump_fcontext saves all registers on the stack of caller (main in our case) and jumps to trampoline, which then jumps to the target function plus10. When jumps back from the target function, jump_fcontext saves all registers on the stack of plus10, restores previous registers to continue to run main.