Debugging: Signals and Subprocesses

[This article was first published on Posts on R-hub blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is a short story about a non-trivial bug in the processx package,
and how I fixed it. It is a good showcase of the some debugging tools.

The bug

processx is an R package to start and manage
external processes. It is used by the callr
package to run code in another R session. The
original bug report
by Will Landau has a nice, clean
reproducible example:

https://github.com/r-lib/processx/pull/237 seems to solve
https://github.com/r-lib/processx/issues/236 for processx,
but not for callr.

> fun <- function() {
>   parallel::mclapply(1:2, function(x) x)
> }
> env <- c(
>   callr::rcmd_safe_env(),
>   PROCESSX_NOTIFY_OLD_SIGCHLD = "true"
> )
> tmp <- callr::r(fun, env = env, show = TRUE)
> #> Error while shutting down parallel: unable to terminate some child processes

A red herring

The references to PR #237 and
issue #236 are a
red herring, and a tricky one!
The error message is exactly the same there, but the reasons are very
different. That issue is about an interference between processx and the
parallel package. This one must be something else, because callr does not
load processx (or any other R package) in the R subprocess it creates:

callr::r(function() loadedNamespaces())

## [1] "compiler"  "graphics"  "utils"     "grDevices"
## [5] "stats"     "datasets"  "methods"   "base"

About SIGCHLD signals

To understand what is going on here, we need to know a bit about signals
and subprocesses. A signal is an asynchronous notification, sent to a
process by a(nother) process or the operating system. A signal means
that an important event has happened, and the process’s normal execution
is interrupted. For some signal types, the process gets a chance to handle
the signal gracefully and then continue execution. The SIGCHLD signal is
such a signal. It is sent by the operating system when a subprocess of
the (parent) process has finished its execution. When the parent process
receives SIGCHLD, the subprocess has already finished, but it is still
in the OS’s process table. It is dead but still hanging around: it is a
zombie process.

To completely eliminate the subprocess, the parent process needs to either
read out its exit status, or tell the OS that it is not interested in the
exit status. The parent process can also pre-emptively tell the OS that
it is not interested in the exit status of its subprocesses, and in this
case no SIGCHLD signals are delivered to it at all.

Of course it is my bad

To make sure that this issue is specific to processx or callr, I tried if
a subprocess started with system() has the same issue, and it does not:

system("R -q -e 'parallel::mclapply(1:2, function(x) x)'")

Whereas processx/callr does:

tmp <- callr::r(
  function() parallel::mclapply(1:2, function(x) x),
  show = TRUE
)

## Error while shutting down parallel: unable to terminate some child processes

Still, it is possible (however unlikely 😉) that the bug
is in the parallel package, e.g. because it does not set up the signal
handler for SIGCHLD properly, when the R session was started by
processx/callr.

The parallel package sets up a
finalizer, that runs when R exits.
This finalizer tries to eliminate all subprocesses that were started by
parallel, and if it fails to do that, it emits the error message that we
see above.

When eliminating subprocesses, parallel sends them a SIGKILL signal, which
is not possible to catch and handle, so it is sure that their execution
has finished, and they are in a zombie state. In fact, mclapply() already
tries to clean up the subprocesses it has started, so they are probably
already in the zombie state when mclapply() returns. This is
easy to check with the ps package:

tmp <- callr::r(
  function() {
    parallel::mclapply(1:2, function(x) x)
    print(sapply(ps::ps_children(ps::ps_handle()), ps::ps_status))
  },
  show = TRUE
)

## [1] "zombie" "zombie"
## Error while shutting down parallel: unable to terminate some child processes

Indeed, both subprocesses of the callr R process are zombies.
Clearly, the callr R process did not receive or did not handle their
SIGCHLD signals. To make sure that the SIGCHLD signal handler is
properly set up in parallel, we need to debug parallel’s C code, in the
callr subprocess.

We will use the lldb debugger here, as that is
the default on macOS. gdb has similar functionality, you probably want to
use gdb for gcc and Linux.

To debug the callr subprocess, we tell lldb to wait for a process called R:

❯ lldb -w -n R
(lldb) process attach --name "R" --waitfor

Then in another terminal, we run callr::r():

callr::r(function() parallel::mclapply(1:2, function(x) x))

As soon as the callr subprocess starts, lldb will stop it (using a SIGSTOP
signal that cannot be caught):

❯ lldb -w -n R
(lldb) process attach --name "R" --waitfor
Process 4196 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff74b75a4b libsystem_info.dylib`xdrmem_getlong_aligned + 75
libsystem_info.dylib`xdrmem_getlong_aligned:
->  0x7fff74b75a4b <+75>: callq  0x7fff74b759f0            ; _OSSwapInt32
    0x7fff74b75a50 <+80>: movq   -0x18(%rbp), %rdx
    0x7fff74b75a54 <+84>: movl   %eax, (%rdx)
    0x7fff74b75a56 <+86>: movq   -0x10(%rbp), %rdx
Target 0: (R) stopped.

Executable module set to "/Library/Frameworks/R.framework/Resources/bin/exec/R".
Architecture set to: x86_64h-apple-macosx-.
(lldb)

Now we can set break points on
parallel C functions.
For a better debugging experience, it makes sense to recompile R without
optimization, and with debug symbols, so we can actually see the original C
source code in the debugger. But for now we can get away without that.

(lldb) b mc_fork
Breakpoint 1: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.
(lldb) b mc_cleanup
Breakpoint 2: no locations (pending).
WARNING:  Unable to resolve breakpoint to any actual locations.

Ideally, we would set a break point on the setup_sig_handler() function,
that sets up the signal handler, but this function is optimized out by
the compiler. Lets continue running the process:

(lldb) c
Process 4196 resuming
1 location added to breakpoint 1
1 location added to breakpoint 2
Process 4196 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x000000010e89ea10 parallel.so`mc_fork
parallel.so`mc_fork:
->  0x10e89ea10 <+0>: pushq  %rbp
    0x10e89ea11 <+1>: movq   %rsp, %rbp
    0x10e89ea14 <+4>: pushq  %r15
    0x10e89ea16 <+6>: pushq  %r14
Target 0: (R) stopped.
(lldb)

lldb stops the process at mc_fork(). parallel is just about to start a
subprocess here. We step throught mc_fork() a bit (there is not easy way
to step until the end of a function in lldb, AFAICT). We go on until we
see that sigaction(2) is called, this should set up the signal handler:

(lldb) n
Process 4196 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step over
    frame #0: 0x000000010e89eac9 parallel.so`mc_fork + 185
parallel.so`mc_fork:
->  0x10e89eac9 <+185>: callq  0x10e8a089a               ; symbol stub for: sigaction
    0x10e89eace <+190>: movl   $0x80000, -0x58(%rbp)     ; imm = 0x80000
    0x10e89ead5 <+197>: leaq   -0x58(%rbp), %rsi
    0x10e89ead9 <+201>: leaq   -0x44(%rbp), %rdx
Target 0: (R) stopped.
(lldb) n
Process 4196 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step over
    frame #0: 0x000000010e89eace parallel.so`mc_fork + 190
parallel.so`mc_fork:
->  0x10e89eace <+190>: movl   $0x80000, -0x58(%rbp)     ; imm = 0x80000
    0x10e89ead5 <+197>: leaq   -0x58(%rbp), %rsi
    0x10e89ead9 <+201>: leaq   -0x44(%rbp), %rdx
    0x10e89eadd <+205>: movl   $0x1, %edi
Target 0: (R) stopped.

If you look at the source code, sigaction(2) is called from
setup_sig_handler(), but the compiler inlined that function.
Now that lldb has resolved the break points to the loaded parallel.so
shared lib, we can double check that:

(lldb) image lookup -n setup_sig_handler
(lldb)

Never mind, sigaction(2) was already called, so the signal handler
should be set up. Let’s check it. We can call sigaction(2) to query the
current SIGCHLD handler:

(lldb) call (void*) malloc(sizeof(struct sigaction))                                                               (void *) $0 = 0x00007fe6c7e305f0
(lldb) call (int) __sigaction(20, NULL, $0)
(int) $1 = 0
(lldb) p ((struct sigaction *)$0)->__sigaction_u                                                                   (__sigaction_u) $2 = {
  __sa_handler = 0x000000010e8a0750 (parallel.so`parent_sig_handler)
  __sa_sigaction = 0x000000010e8a0750 (parallel.so`parent_sig_handler)
}

20 is the number of SIGCHLD, see man signal, and for the (platform
dependent) structure of struct sigaction see man sigaction.
In any case, parallel.so’s parent_sig_handler is indeed set up properly.
Now we tell lldb to stop on receiving a SIGCHLD signal:

(lldb) process handle SIGCHLD --notify true --pass true --stop true                                                NAME         PASS   STOP   NOTIFY
===========  =====  =====  ======
SIGCHLD      true   true   true

and we are ready to continue the process. It will stop again at mc_fork(),
because we are starting two subprocesses in mclapply(). Then we continue
again:

(lldb) c
Process 4196 resuming
Process 4196 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x000000010e89ea10 parallel.so`mc_fork
parallel.so`mc_fork:
->  0x10e89ea10 <+0>: pushq  %rbp
    0x10e89ea11 <+1>: movq   %rsp, %rbp
    0x10e89ea14 <+4>: pushq  %r15
    0x10e89ea16 <+6>: pushq  %r14
Target 0: (R) stopped.
(lldb) c
Process 4196 resuming
Process 4196 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x000000010e89e680 parallel.so`mc_cleanup
parallel.so`mc_cleanup:
->  0x10e89e680 <+0>: pushq  %rbp
    0x10e89e681 <+1>: movq   %rsp, %rbp
    0x10e89e684 <+4>: pushq  %r15
    0x10e89e686 <+6>: pushq  %r14
Target 0: (R) stopped.
(lldb)

Interestingly, we got to mc_cleanup, even though the SIGCHLD signal(s)
should have arrived first. From another R session, we can check that the
subprocesses of parallel are zombies already:

❯ sapply(ps::ps_children(ps::ps_handle(4196L)), ps::ps_status)
[1] "zombie" "zombie"

We can continue still, and hope that the SIGCHLDs will still arrive,
but they won’t:

(lldb) c
Process 4196 resuming
Process 4196 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x000000010e89e680 parallel.so`mc_cleanup
parallel.so`mc_cleanup:
->  0x10e89e680 <+0>: pushq  %rbp
    0x10e89e681 <+1>: movq   %rsp, %rbp
    0x10e89e684 <+4>: pushq  %r15
    0x10e89e686 <+6>: pushq  %r14
Target 0: (R) stopped.
(lldb) c
Process 4196 resuming
Process 4196 exited with status = 0 (0x00000000)
(lldb)

Clearly, the signal handler is properly set up in parallel, but the
signals are not delivered to the process. No doubt, this is a bug in
the processx or callr code.

Some hypotheses

There are very few possible reasons for the OS not sending out SIGCHLD
signals. The first is that the parent process explicitly tells the OS that
it is not interested. This is done by setting the signal handler to
SIG_IGN with sigaction(2). This is unlikely to be the case for us,
since we saw that parallel did set up a signal handler properly. By
searching the R source code and the source code of processx and callr for
SIG_IGN, it is obvious that none of them does this.

The other reason is that the SIGCHLD signal is blocked by the process.
Blocking a signal means that the process tells the OS, that it is currently
not ready to process it, and it should be delivered later, when the process
has unblocked the signal. If SIGCHLD was blocked in the callr subprocess,
that would be a good explanation for the OS not sending it.

Validating the hypothesis

The sigprocmask(2) system call can be used to query or manipulate the
set of blocked signals. So we can re-run our callr::r() reprex, with
lldb on the callr subprocess again, and examine the state of the
SIGCHLD signal:

(lldb) call (void*) malloc(sizeof(sigset_t))
(void *) $3 = 0x00007f92704c4300
(lldb) call (int) sigprocmask(0, NULL, $3)
(int) $4 = 0
(lldb) p (int) sigismember((const sigset_t*) $3, 20)
(int) $7 = 1

SIGCHLD is indeed blocked! We got it!

Fixing the bug

Clearly, if base::system() works and callr::r() and processx::run()
do not, that means that processx causes the SIGCHLD to be blocked in the
subprocess.

When processx starts a subprocess, it first calls fork(2) to create a
copy of the current process, and then execvp(3) to replace that with
another executable. fork() creates an identical copy, i.e. the signal
handlers and the blocked signals are the same in the subprocess.
execvp(3) resets the signal handlers to their defaults, which is great,
but apparently, it does not reset the set of blocked signals. So if
SIGCHLD was blocked when processx called fork(2) then it will be blocked
in the subprocess as well. Indeed,
this is how processx started the subprocess
before fixing this bug:

  processx__block_sigchld();

  pid = fork();

  if (pid == -1) {		/* ERROR */
    err = -errno;
    if (signal_pipe[0] >= 0) close(signal_pipe[0]);
    if (signal_pipe[1] >= 0) close(signal_pipe[1]);
    if (cpty) close(pty_master_fd);
    processx__unblock_sigchld();
    R_THROW_SYSTEM_ERROR_CODE(err, "Cannot fork when running '%s'",
                              ccommand);
  }

  /* CHILD */
  if (pid == 0) {
    if (cpty) close(pty_master_fd);
    processx__child_init(handle, pipes, num_connections, ccommand, cargs,
                         signal_pipe[1], cstdin, cstdout, cstderr,
                         pty_name, cenv, &options, ctree_id);
    R_THROW_SYSTEM_ERROR("Cannot start child process when running '%s'",
                         ccommand);
  }

processx blocks SIGCHLD before forking, for simplicity, so it does not
have to worry about concurrency. processx__child_init() does a number
of things, but it does not unblock this signal, so it stays blocked.

The fix
is clearly to unblock the signal in the child process, before calling
execvp(3) in processx__child_init():

  /* CHILD */
  if (pid == 0) {
    /* LCOV_EXCL_START */
    if (cpty) close(pty_master_fd);
    processx__unblock_sigchld();
    processx__child_init(handle, pipes, num_connections, ccommand, cargs,
			 signal_pipe[1], cstdin, cstdout, cstderr,
                         pty_name, cenv, &options, ctree_id);
    R_THROW_SYSTEM_ERROR("Cannot start child process when running '%s'",
                         ccommand);
    /* LCOV_EXCL_STOP */
  }

Conclusion

Debugging is hard, but debuggers like lldb help you greatly. It is worth
to invest some time into learning what they can do: wait for a process to
attach to, call functions and system calls, catch signals, and a lot more.

The fact that the signal mask is not reset at execvp(3) is of course
documented in the manual page of execve(2) (on macOS). This is the system
call that execvp(3) uses internally. I guess another lesson is to
read the manual…

If you enjoyed reading this, you might like similar debugging stories by
Jim Hester
and
Rich FitzJohn.

Resources

Some resources for debugging C/C++ code in R packages:

To leave a comment for the author, please follow the link and comment on their blog: Posts on R-hub blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)