I manage all types of home automation in docker. One of them is the old school ftp for my scanner to upload files. In every running container I also start a tmux daemon, and I always attach the tmux session when I need a shell.
One day during scanning some letters, I attached the tmux session inside the ftp container and soon the scanner started to complain the ftp was not reachable.
The container was dead. Weird.
After spending some time reproducing the issue, I could reliably reproduce the unexpected container exit by doing tmux attach and immediately exiting the bash running in tmux.
How did the bash logout cause the container to exit?
The only option I have is to strace vsftpd. The container exits only when vsftpd exits.
# strace -p 1
strace: Process 1 attached
accept(3, ...
From strace output, the ftp server is blocked in accept call. As soon as the bash exits, it receives SIGCHLD.
accept(3, 0x7ffe6f971450, [28]) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=12, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
alarm(1) = 0
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
alarm(0) = 1
wait4(-1, NULL, WNOHANG, NULL) = 12
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
From the syscall arguments, what’s the pid being killed (the reason for SIGCHLD)? si_pid=12. What is pid 12? It’s the tmux session’s daemon:
# ps -afx -o pid,ppid,user,args
PID PPID USER COMMAND
32 0 root bash
46 32 root \_ strace -p 1
27 0 root tmux new-session -A -s ftptest
1 0 root vsftpd
12 1 root tmux new-session -s ftptest -d
13 12 root \_ -bash
48 13 root \_ ps -afx -o pid,ppid,user,args
And tmux deamon’s parent PID is 1, which is vsftpd.
Well, one thing I notice is that when the last bash exits, the tmux deamon also exits. And we know when a child process exits, kernel sends SIGCHLD to the parent of the child process. This is usually not a problem as the default signal handler of SIGCHLD simply ignores the signal. However, from vsftpd source I see it installs a handler for SIGCHLD. Apparently vsftpd thinks one of its children exits and thus it starts the cleanup process which then hits SIGSEGV that crashes the program.
We have two options regardless of how vsftpd chooses to handle SIGCHLD:
- Let’s not send the SIGCHLD signal out in the first place. However, kernel does this internally, so we can’t really change it.
- Let’s run something else as pid 1, not vsftpd.
To do the latter, docker can actually run the command in a shell. Here is a quote from docker reference page:
You can specify a plain string for the ENTRYPOINT and it will execute in /bin/sh -c
So the simple fix is
-ENTRYPOINT ["vsftpd"]
+ENTRYPOINT /usr/sbin/vsftpd
After the fix, vsftpd and tmux are not parent-child relation anymore:
root@services# ps auxf
USER PID COMMAND
root 32 bash
root 46 \_ ps auxf
root 1 /bin/sh -c /usr/sbin/vsftpd
root 6 /usr/sbin/vsftpd
root 14 tmux new-session -s ftp -d
root 15 \_ -bash
Now exiting tmux doesn’t cause vsftpd to crash. Yay!
One thing to note is that docker stop sends
# strace -p 1
strace: Process 1 attached
wait4(-1, 0x7ffe10cb9adc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
wait4(-1, <unfinished ...>) = ?
This means docker stop has to wait for 10s to stop the container.