Eliminating notification race conditions in systemd

systemd is the most established init system on Linux today. The project started almost 10 years ago and has made its way into almost all distributions, sooner or later. There are some holdouts but those are mostly exceptions. However, almost since its inception, one of the most useful features systemd provides has suffered from race conditions. It is one of those projects that I follow very closely, and there's a lot to learn from reading the code if you're interested in lower level Linux userspace.

The service manager supports reliable startup notifications through a notification socket placed in /run. There are various switches to set the access level for a service, and the protocol is general purpose and extendable to support plenty of other message types. See sd_notify(3) for more.

The sd_notify(3) call is used by services supporting the Type=notify startup notification scheme to notify systemd of their startup, so that precise scheduling of dependencies can be achieved at boot. systemd also provides a convenient systemd-notify(1) tool for use in scripts and such, or for programs that don't want to link to libsystemd or write their own socket code, and still be able to notify systemd of their startup by forking off the binary.

However, the notification protocol was at some point switched to being asynchronous in nature, as deadlocks between dbus, PID 1, and journald manifested due to cyclic dependencies between the three, and the synchronous nature of notifications only added to the problem. This also meant that unless the process that sent the original notification message stayed around, associating the PID obtained from SCM_CREDENTIALS to the cgroup (and hence the unit) was not possible. This has been known since atleast 2016. An attempt to solve this made by proposing a patch that attached task metadata (including cgroup) to a socket message, but was shot down based on efficiency concerns. This would also solve a similar race in journald (which remains non-trivial to fix).

Now, this is where PR#11547 comes in. The solution I went with is to add a new BARRIER=1 notification message, but one must attach a file descriptor to this message, and then systemd will close this fd on its end. The cover letter goes into more detail, but the above is the gist of the approach. If one sends the write end of pipe to the manager, then this means that we'll receive POLLHUP on the read fd when the manager closes the descriptor it received. It is guranteed that messages are processed in order, so anything sent before must have been processed by the time the write end was closed, allowing us to synchronize in a lightweight manner. The first user of this sd_notify_barrier(3) call is systemd-notify itself, which previously suffered from the race.

This API isn't very useful for daemons, since they're long lasting anyway, but it wouldn't hurt to call it once at startup, because sually by the time we call poll, systemd has already closed the write end.