About Contact
contact kkd on freenode
Powered by bpp
Eliminating notification races in systemd
Last updated: "Jun 15 2020"

systemd is the most established init system on Linux today. The project started almost 10 years ago and has made its way into almost all distributions, sooner or later. There are some holdouts but those are mostly exceptions. However, for some time now, one of the most useful features systemd provides has suffered from race conditions when used from unpriviledged context.

The service manager supports reliable startup notifications through a notification socket placed in /run. There are various switches to set the access level for a service, and the protocol is general purpose and extendable to support plenty of other message types. This is not a novel concept, in fact other service managers like s6 support it (through a simpler protocol of inherting an fd and doing a single byte write to it). systemd's inspiration launchd also supports something similar.

The sd_notify(3) API is used by services supporting the Type=notify startup notification scheme to tell systemd that they're up, so that precise scheduling of dependent jobs can be achieved. systemd also provides a convenient systemd-notify(1) tool for use in scripts and such, or for programs that don't want to link to libsystemd or write their own socket code, and still be able to notify systemd of their startup by forking off the binary.

However, the notification protocol was at some point switched to being asynchronous in nature, as deadlocks between D-Bus, PID 1, and Journald manifested due to cyclic dependencies between the three, and the synchronous nature of notifications only added to the problem. This also meant that unless the process that sent the original notification message stayed around, associating the PID obtained from SCM_CREDENTIALS to the cgroup (and hence the unit) was not possible. This has been known since atleast 2016.

BTW, this is also one of the reasons why systemd developers have been trying to convince kernel developers of accepting some form of modern IPC primitive in the kernel (with strong global order, multicast and peer discovery support, as those are one of the primary reasons why you'd choose D-Bus), as systemd itself is a user of a service it is responsible to start up. Other perks like trusted and comprehensive credential metadata for messages are also nice to have.

An attempt to solve this was made by proposing a patch that attached task metadata (including cgroup) to a socket message, but was shot down due to efficiency concerns. This would also solve a similar race in journald (which still remains non-trivial to fix). Perhaps the biggest issue was that systemd-notify was in most cases useless for unpriviledged users (and where it is much more likely to be used, especially for ad-hoc scripts running as user services).

The good news is that with PR15547, the issue has been fixed for good. The solution agreed upon in the end was to add a new BARRIER=1 notification, but one must attach a file descriptor to this message, and then systemd will close this fd on its end. systemd already supports uploading file descriptors to the manager using the sd_notify(3) interface, so the modifications needed turned out to be minimal. You must send this notification separately, only send a single descriptor, and mix no other notifications with it.

The cover letter goes into more detail, but the above is the gist of it all. If one sends the write end of a pipe to the manager, then this means that we'll receive POLLHUP on the read fd when the manager closes the descriptor it received. It is guranteed that messages are processed in order, so anything sent before must have been processed by the time the write end was closed, allowing us to synchronize in a lightweight manner. The first user of this new sd_notify_barrier(3) call is systemd-notify itself, which previously suffered from the race.

This API isn't very useful for daemons, since they're long running anyway, but it wouldn't hurt to call it once at startup, because usually by the time we call poll, systemd has already closed the write end.