レナート   TBFKAYIBYNYAAYB   ﻟﻴﻨﺎﺭﺕ

Fri, 30 Apr 2010

Rethinking PID 1

If you are well connected or good at reading between the lines you might already know what this blog post is about. But even then you may find this story interesting. So grab a cup of coffee, sit down, and read what's coming.

This blog story is long, so even though I can only recommend reading the long story, here's the one sentence summary: we are experimenting with a new init system and it is fun.

Here's the code. And here's the story:

Process Identifier 1

On every Unix system there is one process with the special process identifier 1. It is started by the kernel before all other processes and is the parent process for all those other processes that have nobody else to be child of. Due to that it can do a lot of stuff that other processes cannot do. And it is also responsible for some things that other processes are not responsible for, such as bringing up and maintaining userspace during boot.

Historically on Linux the software acting as PID 1 was the venerable sysvinit package, though it had been showing its age for quite a while. Many replacements have been suggested, only one of them really took off: Upstart, which has by now found its way into all major distributions.

As mentioned, the central responsibility of an init system is to bring up userspace. And a good init system does that fast. Unfortunately, the traditional SysV init system was not particularly fast.

For a fast and efficient boot-up two things are crucial:

What does that mean? Starting less means starting fewer services or deferring the starting of services until they are actually needed. There are some services where we know that they will be required sooner or later (syslog, D-Bus system bus, etc.), but for many others this isn't the case. For example, bluetoothd does not need to be running unless a bluetooth dongle is actually plugged in or an application wants to talk to its D-Bus interfaces. Same for a printing system: unless the machine physically is connected to a printer, or an application wants to print something, there is no need to run a printing daemon such as CUPS. Avahi: if the machine is not connected to a network, there is no need to run Avahi, unless some application wants to use its APIs. And even SSH: as long as nobody wants to contact your machine there is no need to run it, as long as it is then started on the first connection. (And admit it, on most machines where sshd might be listening somebody connects to it only every other month or so.)

Starting more in parallel means that if we have to run something, we should not serialize its start-up (as sysvinit does), but run it all at the same time, so that the available CPU and disk IO bandwidth is maxed out, and hence the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly dynamic in their configuration and use: they are mobile, different applications are started and stopped, different hardware added and removed again. An init system that is responsible for maintaining services needs to listen to hardware and software changes. It needs to dynamically start (and sometimes stop) services as they are needed to run a program or enable some hardware.

Most current systems that try to parallelize boot-up still synchronize the start-up of the various daemons involved: since Avahi needs D-Bus, D-Bus is started first, and only when D-Bus signals that it is ready, Avahi is started too. Similar for other services: livirtd and X11 need HAL (well, I am considering the Fedora 13 services here, ignore that HAL is obsolete), hence HAL is started first, before livirtd and X11 are started. And libvirtd also needs Avahi, so it waits for Avahi too. And all of them require syslog, so they all wait until Syslog is fully started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the serialization of a significant part of the boot process. Wouldn't it be great if we could get rid of the synchronization and serialization cost? Well, we can, actually. For that, we need to understand what exactly the daemons require from each other, and why their start-up is delayed. For traditional Unix daemons, there's one answer to it: they wait until the socket the other daemon offers its services on is ready for connections. Usually that is an AF_UNIX socket in the file-system, but it could be AF_INET[6], too. For example, clients of D-Bus wait that /var/run/dbus/system_bus_socket can be connected to, clients of syslog wait for /dev/log, clients of CUPS wait for /var/run/cups/cups.sock and NFS mounts wait for /var/run/rpcbind.sock and the portmapper IP port, and so on. And think about it, this is actually the only thing they wait for!

Now, if that's all they are waiting for, if we manage to make those sockets available for connection earlier and only actually wait for that instead of the full daemon start-up, then we can speed up the entire boot and start more processes in parallel. So, how can we do that? Actually quite easily in Unix-like systems: we can create the listening sockets before we actually start the daemon, and then just pass the socket during exec() to it. That way, we can create all sockets for all daemons in one step in the init system, and then in a second step run all daemons at once. If a service needs another, and it is not fully started up, that's completely OK: what will happen is that the connection is queued in the providing service and the client will potentially block on that single request. But only that one client will block and only on that one request. Also, dependencies between services will no longer necessarily have to be configured to allow proper parallelized start-up: if we start all sockets at once and a service needs another it can be sure that it can connect to its socket.

Because this is at the core of what is following, let me say this again, with different words and by example: if you start syslog and and various syslog clients at the same time, what will happen in the scheme pointed out above is that the messages of the clients will be added to the /dev/log socket buffer. As long as that buffer doesn't run full, the clients will not have to wait in any way and can immediately proceed with their start-up. As soon as syslog itself finished start-up, it will dequeue all messages and process them. Another example: we start D-Bus and several clients at the same time. If a synchronous bus request is sent and hence a reply expected, what will happen is that the client will have to block, however only that one client and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize parallelization, and the ordering and synchronization is done by the kernel, without any further management from userspace! And if all the sockets are available before the daemons actually start-up, dependency management also becomes redundant (or at least secondary): if a daemon needs another daemon, it will just connect to it. If the other daemon is already started, this will immediately succeed. If it isn't started but in the process of being started, the first daemon will not even have to wait for it, unless it issues a synchronous request. And even if the other daemon is not running at all, it can be auto-spawned. From the first daemon's perspective there is no difference, hence dependency management becomes mostly unnecessary or at least secondary, and all of this in optimal parallelization and optionally with on-demand loading. On top of this, this is also more robust, because the sockets stay available regardless whether the actual daemons might temporarily become unavailable (maybe due to crashing). In fact, you can easily write a daemon with this that can run, and exit (or crash), and run again and exit again (and so on), and all of that without the clients noticing or loosing any request.

It's a good time for a pause, go and refill your coffee mug, and be assured, there is more interesting stuff following.

But first, let's clear a few things up: is this kind of logic new? No, it certainly is not. The most prominent system that works like this is Apple's launchd system: on MacOS the listening of the sockets is pulled out of all daemons and done by launchd. The services themselves hence can all start up in parallel and dependencies need not to be configured for them. And that is actually a really ingenious design, and the primary reason why MacOS manages to provide the fantastic boot-up times it provides. I can highly recommend this video where the launchd folks explain what they are doing. Unfortunately this idea never really took on outside of the Apple camp.

The idea is actually even older than launchd. Prior to launchd the venerable inetd worked much like this: sockets were centrally created in a daemon that would start the actual service daemons passing the socket file descriptors during exec(). However the focus of inetd certainly wasn't local services, but Internet services (although later reimplementations supported AF_UNIX sockets, too). It also wasn't a tool to parallelize boot-up or even useful for getting implicit dependencies right.

For TCP sockets inetd was primarily used in a way that for every incoming connection a new daemon instance was spawned. That meant that for each connection a new process was spawned and initialized, which is not a recipe for high-performance servers. However, right from the beginning inetd also supported another mode, where a single daemon was spawned on the first connection, and that single instance would then go on and also accept the follow-up connections (that's what the wait/nowait option in inetd.conf was for, a particularly badly documented option, unfortunately.) Per-connection daemon starts probably gave inetd its bad reputation for being slow. But that's not entirely fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus instead of plain AF_UNIX sockets. Now, the question is, for those services, can we apply the same parallelizing boot logic as for traditional socket services? Yes, we can, D-Bus already has all the right hooks for it: using bus activation a service can be started the first time it is accessed. Bus activation also gives us the minimal per-request synchronisation we need for starting up the providers and the consumers of D-Bus services at the same time: if we want to start Avahi at the same time as CUPS (side note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we can simply run them at the same time, and if CUPS is quicker than Avahi via the bus activation logic we can get D-Bus to queue the request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the bus-based service activation together enable us to start all daemons in parallel, without any further synchronization. Activation also allows us to do lazy-loading of services: if a service is rarely used, we can just load it the first time somebody accesses the socket or bus name, instead of starting it during boot.

And if that's not great, then I don't know what is great!

Parallelizing File System Jobs

If you look at the serialization graphs of the boot process of current distributions, there are more synchronisation points than just daemon start-ups: most prominently there are file-system related jobs: mounting, fscking, quota. Right now, on boot-up a lot of time is spent idling to wait until all devices that are listed in /etc/fstab show up in the device tree and are then fsck'ed, mounted, quota checked (if enabled). Only after that is fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is interested in another service, an open() (or a similar call) shows that a service is interested in a specific file or file-system. So, in order to improve how much we can parallelize we can make those apps wait only if a file-system they are looking for is not yet mounted and readily available: we set up an autofs mount point, and then when our file-system finished fsck and quota due to normal boot-up we replace it by the real mount. While the file-system is not ready yet, the access will be queued by the kernel and the accessing process will block, but only that one daemon and only that one access. And this way we can begin starting our daemons even before all file systems have been fully made available -- without them missing any files, and maximizing parallelization.

Parallelizing file system jobs and service jobs does not make sense for /, after all that's where the service binaries are usually stored. However, for file-systems such as /home, that usually are bigger, even encrypted, possibly remote and seldom accessed by the usual boot-up daemons, this can improve boot time considerably. It is probably not necessary to mention this, but virtual file systems, such as procfs or sysfs should never be mounted via autofs.

I wouldn't be surprised if some readers might find integrating autofs in an init system a bit fragile and even weird, and maybe more on the "crackish" side of things. However, having played around with this extensively I can tell you that this actually feels quite right. Using autofs here simply means that we can create a mount point without having to provide the backing file system right-away. In effect it hence only delays accesses. If an application tries to access an autofs file-system and we take very long to replace it with the real file-system, it will hang in an interruptible sleep, meaning that you can safely cancel it, for example via C-c. Also note that at any point, if the mount point should not be mountable in the end (maybe because fsck failed), we can just tell autofs to return a clean error code (like ENOENT). So, I guess what I want to say is that even though integrating autofs into an init system might appear adventurous at first, our experimental code has shown that this idea works surprisingly well in practice -- if it is done for the right reasons and the right way.

Also note that these should be direct autofs mounts, meaning that from an application perspective there's little effective difference between a classic mount point and one based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is that shell scripts are evil. Shell is fast and shell is slow. It is fast to hack, but slow in execution. The classic sysvinit boot logic is modelled around shell scripts. Whether it is /bin/bash or any other shell (that was written to make shell scripts faster), in the end the approach is doomed to be slow. On my system the scripts in /etc/init.d call grep at least 77 times. awk is called 92 times, cut 23 and sed 74. Every time those commands (and others) are called, a process is spawned, the libraries searched, some start-up stuff like i18n and so on set up and more. And then after seldom doing more than a trivial string operation the process is terminated again. Of course, that has to be incredibly slow. No other language but shell would do something like that. On top of that, shell scripts are also very fragile, and change their behaviour drastically based on environment variables and suchlike, stuff that is hard to oversee and control.

So, let's get rid of shell scripts in the boot process! Before we can do that we need to figure out what they are currently actually used for: well, the big picture is that most of the time, what they do is actually quite boring. Most of the scripting is spent on trivial setup and tear-down of services, and should be rewritten in C, either in separate executables, or moved into the daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during system boot-up entirely anytime soon. Rewriting them in C takes time, in a few case does not really make sense, and sometimes shell scripts are just too handy to do without. But we can certainly make them less prominent.

A good metric for measuring shell script infestation of the boot process is the PID number of the first process you can start after the system is fully booted up. Boot up, log in, open a terminal, and type echo $$. Try that on your Linux system, and then compare the result with MacOS! (Hint, it's something like this: Linux PID 1823; MacOS PID 154, measured on test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains services should be process babysitting: it should watch services. Restart them if they shut down. If they crash it should collect information about them, and keep it around for the administrator, and cross-link that information with what is available from crash dump systems such as abrt, and in logging systems like syslog or the audit system.

It should also be capable of shutting down a service completely. That might sound easy, but is harder than you think. Traditionally on Unix a process that does double-forking can escape the supervision of its parent, and the old parent will not learn about the relation of the new process to the one it actually started. An example: currently, a misbehaving CGI script that has double-forked is not terminated when you shut down Apache. Furthermore, you will not even be able to figure out its relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot escape the babysitter, and that we can control them as one unit even if they fork a gazillion times?

Different people came up with different solutions for this. I am not going into much detail here, but let's at least say that approaches based on ptrace or the netlink connector (a kernel interface which allows you to get a netlink message each time any process on the system fork()s or exit()s) that some people have investigated and implemented, have been criticised as ugly and not very scalable.

So what can we do about this? Well, since quite a while the kernel knows Control Groups (aka "cgroups"). Basically they allow the creation of a hierarchy of groups of processes. The hierarchy is directly exposed in a virtual file-system, and hence easily accessible. The group names are basically directory names in that file-system. If a process belonging to a specific cgroup fork()s, its child will become a member of the same group. Unless it is privileged and has access to the cgroup file system it cannot escape its group. Originally, cgroups have been introduced into the kernel for the purpose of containers: certain kernel subsystems can enforce limits on resources of certain groups, such as limiting CPU or memory usage. Traditional resource limits (as implemented by setrlimit()) are (mostly) per-process. cgroups on the other hand let you enforce limits on entire groups of processes. cgroups are also useful to enforce limits outside of the immediate container use case. You can use it for example to limit the total amount of memory or CPU Apache and all its children may use. Then, a misbehaving CGI script can no longer escape your setrlimit() resource control by simply forking away.

In addition to container and resource limit enforcement cgroups are very useful to keep track of daemons: cgroup membership is securely inherited by child processes, they cannot escape. There's a notification system available so that a supervisor process can be notified when a cgroup runs empty. You can find the cgroups of a process by reading /proc/$PID/cgroup. cgroups hence make a very good choice to keep track of processes for babysitting purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a daemon starts, ends or crashes, but also set up a good, minimal, and secure working environment for it.

That means setting obvious process parameters such as the setrlimit() resource limits, user/group IDs or the environment block, but does not end there. The Linux kernel gives users and administrators a lot of control over processes (some of it is rarely used, currently). For each process you can set CPU and IO scheduler controls, the capability bounding set, CPU affinity or of course cgroup environments with additional limits, and more.

As an example, ioprio_set() with IOPRIO_CLASS_IDLE is a great away to minimize the effect of locate's updatedb on system interactivity.

On top of that certain high-level controls can be very useful, such as setting up read-only file system overlays based on read-only bind mounts. That way one can run certain daemons so that all (or some) file systems appear read-only to them, so that EROFS is returned on every write request. As such this can be used to lock down what daemons can do similar in fashion to a poor man's SELinux policy system (but this certainly doesn't replace SELinux, don't get any bad ideas, please).

Finally logging is an important part of executing services: ideally every bit of output a service generates should be logged away. An init system should hence provide logging to daemons it spawns right from the beginning, and connect stdout and stderr to syslog or in some cases even /dev/kmsg which in many cases makes a very useful replacement for syslog (embedded folks, listen up!), especially in times where the kernel log buffer is configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code of Upstart, it is very well commented and easy to follow. It's certainly something other projects should learn from (including my own).

That being said, I can't say I agree with the general approach of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its functionality is a super-set of it, and provides compatibility to some degree with the well known SysV init scripts. It's main feature is its event-based approach: starting and stopping of processes is bound to "events" happening in the system, where an "event" can be a lot of different things, such as: a network interfaces becomes available or some other software has been started.

Upstart does service serialization via these events: if the syslog-started event is triggered this is used as an indication to start D-Bus since it can now make use of Syslog. And then, when dbus-started is triggered, NetworkManager is started, since it may now use D-Bus, and so on.

One could say that this way the actual logical dependency tree that exists and is understood by the admin or developer is translated and encoded into event and action rules: every logical "a needs b" rule that the administrator/developer is aware of becomes a "start a when b is started" plus "stop a when b is stopped". In some way this certainly is a simplification: especially for the code in Upstart itself. However I would argue that this simplification is actually detrimental. First of all, the logical dependency system does not go away, the person who is writing Upstart files must now translate the dependencies manually into these event/action rules (actually, two rules for each dependency). So, instead of letting the computer figure out what to do based on the dependencies, the user has to manually translate the dependencies into simple event/action rules. Also, because the dependency information has never been encoded it is not available at runtime, effectively meaning that an administrator who tries to figure our why something happened, i.e. why a is started when b is started, has no chance of finding that out.

Furthermore, the event logic turns around all dependencies, from the feet onto their head. Instead of minimizing the amount of work (which is something that a good init system should focus on, as pointed out in the beginning of this blog story), it actually maximizes the amount of work to do during operations. Or in other words, instead of having a clear goal and only doing the things it really needs to do to reach the goal, it does one step, and then after finishing it, it does all steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus is in no way an indication that NetworkManager should be started too (but this is what Upstart would do). It's right the other way round: when the user asks for NetworkManager, that is definitely an indication that D-Bus should be started too (which is certainly what most users would expect, right?).

A good init system should start only what is needed, and that on-demand. Either lazily or parallelized and in advance. However it should not start more than necessary, particularly not everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event logic. It appears to me that most events that are exposed in Upstart actually are not punctual in nature, but have duration: a service starts, is running, and stops. A device is plugged in, is available, and is plugged out again. A mount point is in the process of being mounted, is fully mounted, or is being unmounted. A power plug is plugged in, the system runs on AC, and the power plug is pulled. Only a minority of the events an init system or process supervisor should handle are actually punctual, most of them are tuples of start, condition, and stop. This information is again not available in Upstart, because it focuses in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are in some way mitigated by certain more recent changes in Upstart, particularly condition based syntaxes such as start on (local-filesystems and net-device-up IFACE=lo) in Upstart rule files. However, to me this appears mostly as an attempt to fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though some choices might be questionable (see above), and there are certainly a lot of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and launchd. Most of them offer little substantial more than Upstart or sysvinit. The most interesting other contender is Solaris SMF, which supports proper dependencies between services. However, in many ways it is overly complex and, let's say, a bit academic with its excessive use of XML and new terminology for known things. It is also closely bound to Solaris specific features such as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because after I have hopefully explained above what I think a good PID 1 should be doing and what the current most used system does, we'll now come to where the beef is. So, go and refill you coffee mug again. It's going to be worth it.

You probably guessed it: what I suggested above as requirements and features for an ideal init system is actually available now, in a (still experimental) init system called systemd, and which I hereby want to announce. Again, here's the code. And here's a quick rundown of its features, and the rationale behind them:

systemd starts up and supervises the entire system (hence the name...). It implements all of the features pointed out above and a few more. It is based around the notion of units. Units have a name and a type. Since their configuration is usually loaded directly from the file system, these unit names are actually file names. Example: a unit avahi.service is read from a configuration file by the same name, and of course could be a unit encapsulating the Avahi daemon. There are several kinds of units:

  1. service: these are the most obvious kind of unit: daemons that can be started, stopped, restarted, reloaded. For compatibility with SysV we not only support our own configuration files for services, but also are able to read classic SysV init scripts, in particular we parse the LSB header, if it exists. /etc/init.d is hence not much more than just another source of configuration.
  2. socket: this unit encapsulates a socket in the file-system or on the Internet. We currently support AF_INET, AF_INET6, AF_UNIX sockets of the types stream, datagram, and sequential packet. We also support classic FIFOs as transport. Each socket unit has a matching service unit, that is started if the first connection comes in on the socket or FIFO. Example: nscd.socket starts nscd.service on an incoming connection.
  3. device: this unit encapsulates a device in the Linux device tree. If a device is marked for this via udev rules, it will be exposed as a device unit in systemd. Properties set with udev can be used as configuration source to set dependencies for device units.
  4. mount: this unit encapsulates a mount point in the file system hierarchy. systemd monitors all mount points how they come and go, and can also be used to mount or unmount mount-points. /etc/fstab is used here as an additional configuration source for these mount points, similar to how SysV init scripts can be used as additional configuration source for service units.
  5. automount: this unit type encapsulates an automount point in the file system hierarchy. Each automount unit has a matching mount unit, which is started (i.e. mounted) as soon as the automount directory is accessed.
  6. target: this unit type is used for logical grouping of units: instead of actually doing anything by itself it simply references other units, which thereby can be controlled together. Examples for this are: multi-user.target, which is a target that basically plays the role of run-level 5 on classic SysV system, or bluetooth.target which is requested as soon as a bluetooth dongle becomes available and which simply pulls in bluetooth related services that otherwise would not need to be started: bluetoothd and obexd and suchlike.
  7. snapshot: similar to target units snapshots do not actually do anything themselves and their only purpose is to reference other units. Snapshots can be used to save/rollback the state of all services and units of the init system. Primarily it has two intended use cases: to allow the user to temporarily enter a specific state such as "Emergency Shell", terminating current services, and provide an easy way to return to the state before, pulling up all services again that got temporarily pulled down. And to ease support for system suspending: still many services cannot correctly deal with system suspend, and it is often a better idea to shut them down before suspend, and restore them afterwards.

All these units can have dependencies between each other (both positive and negative, i.e. 'Requires' and 'Conflicts'): a device can have a dependency on a service, meaning that as soon as a device becomes available a certain service is started. Mounts get an implicit dependency on the device they are mounted from. Mounts also gets implicit dependencies to mounts that are their prefixes (i.e. a mount /home/lennart implicitly gets a dependency added to the mount for /home) and so on.

A short list of other features:

  1. For each process that is spawned, you may control: the environment, resource limits, working and root directory, umask, OOM killer adjustment, nice level, IO class and priority, CPU policy and priority, CPU affinity, timer slack, user id, group id, supplementary group ids, readable/writable/inaccessible directories, shared/private/slave mount flags, capabilities/bounding set, secure bits, CPU scheduler reset of fork, private /tmp name-space, cgroup control for various subsystems. Also, you can easily connect stdin/stdout/stderr of services to syslog, /dev/kmsg, arbitrary TTYs. If connected to a TTY for input systemd will make sure a process gets exclusive access, optionally waiting or enforcing it.
  2. Every executed process gets its own cgroup (currently by default in the debug subsystem, since that subsystem is not otherwise used and does not much more than the most basic process grouping), and it is very easy to configure systemd to place services in cgroups that have been configured externally, for example via the libcgroups utilities.
  3. The native configuration files use a syntax that closely follows the well-known .desktop files. It is a simple syntax for which parsers exist already in many software frameworks. Also, this allows us to rely on existing tools for i18n for service descriptions, and similar. Administrators and developers don't need to learn a new syntax.
  4. As mentioned, we provide compatibility with SysV init scripts. We take advantages of LSB and Red Hat chkconfig headers if they are available. If they aren't we try to make the best of the otherwise available information, such as the start priorities in /etc/rc.d. These init scripts are simply considered a different source of configuration, hence an easy upgrade path to proper systemd services is available. Optionally we can read classic PID files for services to identify the main pid of a daemon. Note that we make use of the dependency information from the LSB init script headers, and translate those into native systemd dependencies. Side note: Upstart is unable to harvest and make use of that information. Boot-up on a plain Upstart system with mostly LSB SysV init scripts will hence not be parallelized, a similar system running systemd however will. In fact, for Upstart all SysV scripts together make one job that is executed, they are not treated individually, again in contrast to systemd where SysV init scripts are just another source of configuration and are all treated and controlled individually, much like any other native systemd service.
  5. Similarly, we read the existing /etc/fstab configuration file, and consider it just another source of configuration. Using the comment= fstab option you can even mark /etc/fstab entries to become systemd controlled automount points.
  6. If the same unit is configured in multiple configuration sources (e.g. /etc/systemd/system/avahi.service exists, and /etc/init.d/avahi too), then the native configuration will always take precedence, the legacy format is ignored, allowing an easy upgrade path and packages to carry both a SysV init script and a systemd service file for a while.
  7. We support a simple templating/instance mechanism. Example: instead of having six configuration files for six gettys, we only have one getty@.service file which gets instantiated to getty@tty2.service and suchlike. The interface part can even be inherited by dependency expressions, i.e. it is easy to encode that a service dhcpcd@eth0.service pulls in avahi-autoipd@eth0.service, while leaving the eth0 string wild-carded.
  8. For socket activation we support full compatibility with the traditional inetd modes, as well as a very simple mode that tries to mimic launchd socket activation and is recommended for new services. The inetd mode only allows passing one socket to the started daemon, while the native mode supports passing arbitrary numbers of file descriptors. We also support one instance per connection, as well as one instance for all connections modes. In the former mode we name the cgroup the daemon will be started in after the connection parameters, and utilize the templating logic mentioned above for this. Example: sshd.socket might spawn services sshd@192.168.0.1-4711-192.168.0.2-22.service with a cgroup of sshd@.service/192.168.0.1-4711-192.168.0.2-22 (i.e. the IP address and port numbers are used in the instance names. For AF_UNIX sockets we use PID and user id of the connecting client). This provides a nice way for the administrator to identify the various instances of a daemon and control their runtime individually. The native socket passing mode is very easily implementable in applications: if $LISTEN_FDS is set it contains the number of sockets passed and the daemon will find them sorted as listed in the .service file, starting from file descriptor 3 (a nicely written daemon could also use fstat() and getsockname() to identify the sockets in case it receives more than one). In addition we set $LISTEN_PID to the PID of the daemon that shall receive the fds, because environment variables are normally inherited by sub-processes and hence could confuse processes further down the chain. Even though this socket passing logic is very simple to implement in daemons, we will provide a BSD-licensed reference implementation that shows how to do this. We have ported a couple of existing daemons to this new scheme.
  9. We provide compatibility with /dev/initctl to a certain extent. This compatibility is in fact implemented with a FIFO-activated service, which simply translates these legacy requests to D-Bus requests. Effectively this means the old shutdown, poweroff and similar commands from Upstart and sysvinit continue to work with systemd.
  10. We also provide compatibility with utmp and wtmp. Possibly even to an extent that is far more than healthy, given how crufty utmp and wtmp are.
  11. systemd supports several kinds of dependencies between units. After/Before can be used to fix the ordering how units are activated. It is completely orthogonal to Requires and Wants, which express a positive requirement dependency, either mandatory, or optional. Then, there is Conflicts which expresses a negative requirement dependency. Finally, there are three further, less used dependency types.
  12. systemd has a minimal transaction system. Meaning: if a unit is requested to start up or shut down we will add it and all its dependencies to a temporary transaction. Then, we will verify if the transaction is consistent (i.e. whether the ordering via After/Before of all units is cycle-free). If it is not, systemd will try to fix it up, and removes non-essential jobs from the transaction that might remove the loop. Also, systemd tries to suppress non-essential jobs in the transaction that would stop a running service. Non-essential jobs are those which the original request did not directly include but which where pulled in by Wants type of dependencies. Finally we check whether the jobs of the transaction contradict jobs that have already been queued, and optionally the transaction is aborted then. If all worked out and the transaction is consistent and minimized in its impact it is merged with all already outstanding jobs and added to the run queue. Effectively this means that before executing a requested operation, we will verify that it makes sense, fixing it if possible, and only failing if it really cannot work.
  13. We record start/exit time as well as the PID and exit status of every process we spawn and supervise. This data can be used to cross-link daemons with their data in abrtd, auditd and syslog. Think of an UI that will highlight crashed daemons for you, and allows you to easily navigate to the respective UIs for syslog, abrt, and auditd that will show the data generated from and for this daemon on a specific run.
  14. We support reexecution of the init process itself at any time. The daemon state is serialized before the reexecution and deserialized afterwards. That way we provide a simple way to facilitate init system upgrades as well as handover from an initrd daemon to the final daemon. Open sockets and autofs mounts are properly serialized away, so that they stay connectible all the time, in a way that clients will not even notice that the init system reexecuted itself. Also, the fact that a big part of the service state is encoded anyway in the cgroup virtual file system would even allow us to resume execution without access to the serialization data. The reexecution code paths are actually mostly the same as the init system configuration reloading code paths, which guarantees that reexecution (which is probably more seldom triggered) gets similar testing as reloading (which is probably more common).
  15. Starting the work of removing shell scripts from the boot process we have recoded part of the basic system setup in C and moved it directly into systemd. Among that is mounting of the API file systems (i.e. virtual file systems such as /proc, /sys and /dev.) and setting of the host-name.
  16. Server state is introspectable and controllable via D-Bus. This is not complete yet but quite extensive.
  17. While we want to emphasize socket-based and bus-name-based activation, and we hence support dependencies between sockets and services, we also support traditional inter-service dependencies. We support multiple ways how such a service can signal its readiness: by forking and having the start process exit (i.e. traditional daemonize() behaviour), as well as by watching the bus until a configured service name appears.
  18. There's an interactive mode which asks for confirmation each time a process is spawned by systemd. You may enable it by passing systemd.confirm_spawn=1 on the kernel command line.
  19. With the systemd.default= kernel command line parameter you can specify which unit systemd should start on boot-up. Normally you'd specify something like multi-user.target here, but another choice could even be a single service instead of a target, for example out-of-the-box we ship a service emergency.service that is similar in its usefulness as init=/bin/bash, however has the advantage of actually running the init system, hence offering the option to boot up the full system from the emergency shell.
  20. There's a minimal UI that allows you to start/stop/introspect services. It's far from complete but useful as a debugging tool. It's written in Vala (yay!) and goes by the name of systemadm.

It should be noted that systemd uses many Linux-specific features, and does not limit itself to POSIX. That unlocks a lot of functionality a system that is designed for portability to other operating systems cannot provide.

Status

All the features listed above are already implemented. Right now systemd can already be used as a drop-in replacement for Upstart and sysvinit (at least as long as there aren't too many native upstart services yet. Thankfully most distributions don't carry too many native Upstart services yet.)

However, testing has been minimal, our version number is currently at an impressive 0. Expect breakage if you run this in its current state. That said, overall it should be quite stable and some of us already boot their normal development systems with systemd (in contrast to VMs only). YMMV, especially if you try this on distributions we developers don't use.

Where is This Going?

The feature set described above is certainly already comprehensive. However, we have a few more things on our plate. I don't really like speaking too much about big plans but here's a short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap shall be used to control swap devices the same way we already control mounts, i.e. with automatic dependencies on the device tree devices they are activated from, and suchlike. timer shall provide functionality similar to cron, i.e. starts services based on time events, the focus being both monotonic clock and wall-clock/calendar events. (i.e. "start this 5h after it last ran" as well as "start this every monday 5 am")

More importantly however, it is also our plan to experiment with systemd not only for optimizing boot times, but also to make it the ideal session manager, to replace (or possibly just augment) gnome-session, kdeinit and similar daemons. The problem set of a session manager and an init system are very similar: quick start-up is essential and babysitting processes the focus. Using the same code for both uses hence suggests itself. Apple recognized that and does just that with launchd. And so should we: socket and bus based activation and parallelization is something session services and system services can benefit from equally.

I should probably note that all three of these features are already partially available in the current code base, but not complete yet. For example, already, you can run systemd just fine as a normal user, and it will detect that is run that way and support for this mode has been available since the very beginning, and is in the very core. (It is also exceptionally useful for debugging! This works fine even without having the system otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the kernel and elsewhere before finishing work on this: we need swap status change notifications from the kernel similar to how we can already subscribe to mount changes; we want a notification when CLOCK_REALTIME jumps relative to CLOCK_MONOTONIC; we want to allow normal processes to get some init-like powers; we need a well-defined place where we can put user sockets. None of these issues are really essential for systemd, but they'd certainly improve things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be straightforward to check out the code from our repository. In addition, to have something to start with, here's a tarball with unit configuration files that allows an otherwise unmodified Fedora 13 system to work with systemd. We have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which has been prepared for systemd. In the grub menu you can select whether you want to boot the system with Upstart or systemd. Note that this system is minimally modified only. Service information is read exclusively from the existing SysV init scripts. Hence it will not take advantage of the full socket and bus-based parallelization pointed out above, however it will interpret the parallelization hints from the LSB headers, and hence boots faster than the Upstart system, which in Fedora does not employ any parallelization at the moment. The image is configured to output debug information on the serial console, as well as writing it to the kernel log buffer (which you may access with dmesg). You might want to run qemu configured with a virtual serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is looking at pretty screen-shots. Since an init system usually is well hidden beneath the user interface, some shots of systemadm and ps must do:

systemadm

That's systemadm showing all loaded units, with more detailed information on one of the getty instances.

ps

That's an excerpt of the output of ps xaf -eo pid,user,args,cgroup showing how neatly the processes are sorted into the cgroup of their service. (The fourth column is the cgroup, the debug: prefix is shown because we use the debug cgroup controller for systemd, as mentioned earlier. This is only temporary.)

Note that both of these screenshots show an only minimally modified Fedora 13 Live CD installation, where services are exclusively loaded from the existing SysV init scripts. Hence, this does not use socket or bus activation for any existing service.

Sorry, no bootcharts or hard data on start-up times for the moment. We'll publish that as soon as we have fully parallelized all services from the default Fedora install. Then, we'll welcome you to benchmark the systemd approach, and provide our own benchmark data as well.

Well, presumably everybody will keep bugging me about this, so here are two numbers I'll tell you. However, they are completely unscientific as they are measured for a VM (single CPU) and by using the stop timer in my watch. Fedora 13 booting up with Upstart takes 27s, with systemd we reach 24s (from grub to gdm, same system, same settings, shorter value of two bootups, one immediately following the other). Note however that this shows nothing more than the speedup effect reached by using the LSB dependency information parsed from the init script headers for parallelization. Socket or bus based activation was not utilized for this, and hence these numbers are unsuitable to assess the ideas pointed out above. Also, systemd was set to debug verbosity levels on a serial console. So again, this benchmark data has barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things differently then things were traditionally done. Later on, we will publish a longer guide explaining and suggesting how to write a daemon for use with this systemd. Basically, things get simpler for daemon developers:

The list above is very similar to what Apple recommends for daemons compatible with launchd. It should be easy to extend daemons that already support launchd activation to support systemd activation as well.

Note that systemd supports daemons not written in this style perfectly as well, already for compatibility reasons (launchd has only limited support for that). As mentioned, this even extends to existing inetd capable daemons which can be used unmodified for socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get adopted by the distributions it would make sense to port at least those services that are started by default to use socket or bus-based activation. We have written proof-of-concept patches, and the porting turned out to be very easy. Also, we can leverage the work that has already been done for launchd, to a certain extent. Moreover, adding support for socket-based activation does not make the service incompatible with non-systemd systems.

FAQs

Who's behind this?
Well, the current code-base is mostly my work, Lennart Poettering (Red Hat). However the design in all its details is result of close cooperation between Kay Sievers (Novell) and me. Other people involved are Harald Hoyer (Red Hat), Dhaval Giani (Formerly IBM), and a few others from various companies such as Intel, SUSE and Nokia.
Is this a Red Hat project?
No, this is my personal side project. Also, let me emphasize this: the opinions reflected here are my own. They are not the views of my employer, or Ronald McDonald, or anyone else.
Will this come to Fedora?
If our experiments prove that this approach works out, and discussions in the Fedora community show support for this, then yes, we'll certainly try to get this into Fedora.
Will this come to OpenSUSE?
Kay's pursuing that, so something similar as for Fedora applies here, too.
Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?
That's up to them. We'd certainly welcome their interest, and help with the integration.
Why didn't you just add this to Upstart, why did you invent something new?
Well, the point of the part about Upstart above was to show that the core design of Upstart is flawed, in our opinion. Starting completely from scratch suggests itself if the existing solution appears flawed in its core. However, note that we took a lot of inspiration from Upstart's code-base otherwise.
If you love Apple launchd so much, why not adopt that?
launchd is a great invention, but I am not convinced that it would fit well into Linux, nor that it is suitable for a system like Linux with its immense scalability and flexibility to numerous purposes and uses.
Is this an NIH project?
Well, I hope that I managed to explain in the text above why we came up with something new, instead of building on Upstart or launchd. We came up with systemd due to technical reasons, not political reasons.
Don't forget that it is Upstart that includes a library called NIH (which is kind of a reimplementation of glib) -- not systemd!
Will this run on [insert non-Linux OS here]?
Unlikely. As pointed out, systemd uses many Linux specific APIs (such as epoll, signalfd, libudev, cgroups, and numerous more), a port to other operating systems appears to us as not making a lot of sense. Also, we, the people involved are unlikely to be interested in merging possible ports to other platforms and work with the constraints this introduces. That said, git supports branches and rebasing quite well, in case people really want to do a port.
Actually portability is even more limited than just to other OSes: we require a very recent Linux kernel, glibc, libcgroup and libudev. No support for less-than-current Linux systems, sorry.
If folks want to implement something similar for other operating systems, the preferred mode of cooperation is probably that we help you identify which interfaces can be shared with your system, to make life easier for daemon writers to support both systemd and your systemd counterpart. Probably, the focus should be to share interfaces, not code.
I hear [fill one in here: the Gentoo boot system, initng, Solaris SMF, runit, uxlaunch, ...] is an awesome init system and also does parallel boot-up, so why not adopt that?
Well, before we started this we actually had a very close look at the various systems, and none of them did what we had in mind for systemd (with the exception of launchd, of course). If you cannot see that, then please read again what I wrote above.

Contributions

We are very interested in patches and help. It should be common sense that every Free Software project can only benefit from the widest possible external contributions. That is particularly true for a core part of the OS, such as an init system. We value your contributions and hence do not require copyright assignment (Very much unlike Canonical/Upstart!). And also, we use git, everybody's favourite VCS, yay!

We are particularly interested in help getting systemd to work on other distributions, besides Fedora and OpenSUSE. (Hey, anybody from Debian, Gentoo, Mandriva, MeeGo looking for something to do?) But even beyond that we are keen to attract contributors on every level: we welcome C hackers, packagers, as well as folks who are interested to write documentation, or contribute a logo.

Community

At this time we only have source code repository and an IRC channel (#systemd on Freenode). There's no mailing list, web site or bug tracking system. We'll probably set something up on freedesktop.org soon. If you have any questions or want to contact us otherwise we invite you to join us on IRC!

Update: our GIT repository has moved.

posted at: 10:46 | path: /projects | permanent link to this entry | 174 comments


Posted by Raphael (esarbee) at Fri Apr 30 13:12:10 2010
Woah, quite a read! While I somewhat dislike the idea of yet another services management solution, I like it coming from you. You keep rocking the boat - as you did with PA, which I like very much - and that's a good thing.

I admit that I didn't read it with the attention it desired and promise to do so later the day. I do hope, however, that creating custom services will remain at least as easy as it is now. ;)

Posted by Alex Murray at Fri Apr 30 13:27:26 2010
This sounds like the kind of innovation Linux needs - a clearly well thought out solution to a problem, not just someone scratching an itch. Great work as always Lennart. The simplicity of the design (using implicit dependencies rather than hard-coding them) is awesome. Sounds perfect for the embedded space as well.

Would be awesome to see this get picked up by the big players (Ubuntu, Fedora, OpenSUSE, Debian etc).

Posted by Marco Barisione at Fri Apr 30 13:46:42 2010
First you broke networking then sound and now booting? :P

Posted by Kay Sievers at Fri Apr 30 14:02:24 2010
Sounds great, nice announcement. It runs well here on my box. Still a looong way to go ...

Happy so far, and good to know that all the many hours we spent on the phone lead to something that matters - however it will look like in the end. :)

Posted by Luiz Augusto von Dentz at Fri Apr 30 14:09:12 2010
Pretty amazing I must say, I just wonder now if systemd would incorporate things like powertop, monitoring processes/detecting process responsiveness and things like that.

Posted by Michael Scherer at Fri Apr 30 14:29:07 2010
This look nice. But maybe you should have splitted the article in smaller piece, and published them one by one.

About swap, do you think systemd could be extended to dynamically create swap files on the fly, as done on os x, as part as the babysitting ? This would allow distribution to have a simpler partitionnement step, since user will no longer care about this. Of course, some barriers should be added to avoid filling the harddrive with swap file. ( and of course, we should be sure that swap files are as fast as swap partitions ).

Posted by Joshua Pritikin at Fri Apr 30 15:15:04 2010
It sounds like you are fixing real design flaws in upstart. I hope you are aware of runit and bcron. http://smarden.org/runit/ http://untroubled.org/bcron/bcron.html

These are not complete solutions like you are proposing with systemd, but you ought to be familiar with the design of these two tools. I find them exceptionally well engineered.

Posted by John Drinkwater at Fri Apr 30 16:19:07 2010
What was the reason for naming services as /etc/systemd/system/avahi.service rather than /etc/systemd/system/service/avahi (same goes for all units)
Would be more readable (in ps, etc), and get rid of file name extension creep…

Posted by Grahame Bowland at Fri Apr 30 16:21:44 2010
A major advantage of the startup sequence being in shell is that an administrator can easily insert bits of code to track down problems. It sounds like your design will make it quite a bit more difficult to track down odd things.

For example, I had a RHEL5.5 machine the other day with a dodgy autofs setup; whenever 'autofs' started it remounted '/' readonly. Easy to track down at the moment, but it sounds like it might be trickier with systemd.

While there's a bit of a performance hit, I think on servers where you're booting very infrequently bootup speed is worth trading for determinism and transparency, plus the ability to modify and debug the system easily.

So, to be positive: how would you approach figuring out which startup script / service is causing a nasty problem under systemd?

Posted by Damien Thebault at Fri Apr 30 16:22:24 2010
This looks really nice and it removes a lot of problem from the daemon writers.

In addition, since it encourages a certain design of daemons (no fork, error messages on stderr), I think it's then even easier to use those daemons from any init system.

I really think that something like this should be used in many distributions and become the standard init on linux.

Posted by PJ at Fri Apr 30 17:05:41 2010
re: history

This reminds me a little bit of djb's daemontools thing.  And also of Richard Gooch's Bootscripts ca. 2002 ( http://www.safe-mbox.com/~rgooch/linux/boot-scripts/index.html ).

You seem to have taken the next step, however, and got services essentially autoconfiguring their own dependencies, which is awesome.

re: 'shell scripts are slow'
As Grahame Bowland points out above, the advantage of shell scripts is ease-of-debugging/modification.  I see a few options:
* move them to some dynamic language like python or groovy or something where they can be compiled and so run faster
* provide stripped-down versions of the common shell utils (awk, sed, etc) as builtins to the shell that fall back to calling the full version in complicated cases.  So the simple "sed 's/foo/bar/;'" case could be optimized into a shell builtin. thereby saving a process-spawn.

Also re: startup tools, have you looked at start-stop-daemon ?

Posted by nine at Fri Apr 30 17:36:11 2010
It's not an issue for SSD drives which will surely replace disk drives sooner or later, but: doesn't starting a whole bunch of daemons at once end up spawning a lot of IO requests, causing your disk to spend a whole lot of time on seeking overhead rather than actually reading/writing data?

Posted by Davide Repetto at Fri Apr 30 17:38:19 2010
Very interesting stuff indeed. As usual you rock Lennart!

I understand there may be an option to automatically shutdown seldom used services, do you envision a simple time-out or are you going for a self adjusting timeout?

Posted by sjansen at Fri Apr 30 17:50:08 2010
@nine That's the kernel's responsibility. Perhaps it was a valid concern a couple decades ago, but today it makes sense to design a system that takes advantage of Linux's high quality IO schedulers.

Posted by James Mansion at Fri Apr 30 18:30:01 2010
I don't think Solaris SMF is really the only other major system that handles dependencies and on-demand startup.

You want a system where you can say 'always start A' and 'start B,C on demand' and 'C depends on B' so starting C will start B?  Windows does this.

Posted by Paul Jakma at Fri Apr 30 18:53:02 2010
The process group stuff sounds very close to the contracts stuff put in place in Solaris for SMF. Just in case you're interested in looking over there.

Posted by Robert Szalai at Fri Apr 30 19:18:32 2010
As a personal opinion I very much like this idea. Am I right thinking that to use this to full potential one would need to modify the daemons? Also would it imply that properly written daemons won't need any init scripts? I pretty much dislike the idea of having scripts altogether, the daemons could just read their configuration files. Just wondering why would this be unfeasible, could someone enlighten me, in case I'm too hopeful?

Posted by Anonymous at Fri Apr 30 20:29:48 2010
Very impressive architecture.

I agree with your complaint about shell scripts, and at the same time I want to preserve the configurability they provide.  One crazy notion that might work: what if you use a compiled language like Vala to write the startup scripts, keep the Vala source as the canonical location, compile a binary from the source, and use make-style logic to decide if you need to recompile from the source?  With sufficient library support from systemd, vala should prove nearly as comfortable as shell, but you end up with a fast compiled binary to run, and in-process handling of things like string operations.

(If you want to avoid process startup times entirely, you could compile all the Vala configuration files into a single binary with various modules/functions/etc.)

Posted by Peter Lister at Fri Apr 30 20:31:25 2010
Damn good thinking.  As a sys admin I have hated sysvinit (and the crap that app authors and distributions put in it) for years.

Can you expand on what you think should happen at suspend / hibernate? And what happens for hot-swap hardware?

It seems to me that power-up, suspend/resume and discovery/insertion/removal of hardware are all general events that should be reacted to correctly.  The discovery of a storage medium and the filesystem(s) on it, the subsequent mounting and the starting of appropriate services are essentially the same whether it's /home on a SCSI disk detected at start-up and mounted so that logins can happen or just my inserting my MP3 player to have its podcasts updated.

Posted by Eric Moret at Fri Apr 30 21:50:08 2010
I love everything I read to far. There is one thing I ought to mention though! In the same vein as Polyp Audio (later renamed to Pulse Audio), you should be aware that System D has a somewhat negative meaning in french. See wikipedia entry on System_D.

Posted by Colin Guthrie at Fri Apr 30 22:01:07 2010
Awesome work. I now forgive you for spending time away from #pulseaudio :p

My two major problems with this article:
1. It's very biased towards coffee. I am a tea drinker you insensitive clod!

2. PID 1 is a silly name. You should have called it PID v2.0 like all the cool kids do on the web!

I can't think of any real/sensible criticism so I'll shut up now.

KUTGW as always :)

Col

Posted by Anon at Fri Apr 30 22:15:19 2010
Just when the last init replacement fell the init replacement war starts back up again!

I'd just like to see people standardise on one ideally but any idea if ChromeOS or MeeGo would benefit from this?

Posted by Ahmed Kamal at Fri Apr 30 22:34:00 2010
Wow, quiet the read. Extremely impressive design and analysis. Please keep the informative posts Lenart. And please keep pushing Linux forward :)

Posted by Dieter_be at Fri Apr 30 22:51:33 2010
very interesting read.
I don't think shell scripts are bad though. Sure they are slower and cause bumps in your pids, but they are so easy to hack on.  I think that's the most important.

Posted by Colin McCabe at Fri Apr 30 22:56:29 2010
Looks good so far!

Is the /sbin/service and chkconfig interface going to change with systemd?

Posted by Richi Plana at Fri Apr 30 23:14:09 2010
OMG! Finally!! Several people in the past (myself included) have opened up the idea of implementing system startup in a smarter way (only starting services that are needed and dynamically start things a'la xinetd), but would always get shot down with all sorts of excuses or the infamous "code it yourself" remark.

Thanks for starting this! Hope things go far.

Posted by anonymous at Sat May 1 00:39:14 2010
I've not read through all of this yet, but want to suggest haskell as a possible shell script replacement. haskell is a language with precise semantics - that translates to very tight control of state and could enable very succinct specification of shell script behaviour. you can really use it, it is very easy to understand at it's core (lambda calculus). you could connect with that community, they are very clever i suppose and the code is easy to read if what it describes is "boring" or easy. it might just be perfect. speed is in the same league as C, I think it will use LLVM very soon as well. just have a peek and look at some (easy) code examples!

And there are already replacements for grep, regular expressions and stuff like that to be found in the package repository at hackage.haskell.org , albeit maybe not perfectly structured.

this is just a suggestion :=) take it for what it's worth...

Posted by sztanpet at Sat May 1 00:44:40 2010
I was wondering if it would add any value to have Lua as the configuration format, it might be overkill but having a full fledged scripting language might come in handy

Posted by Claes at Sat May 1 01:27:57 2010
Very interesting. When you eventually start to design the scheduling functionality (cron "replacement"), please consider applying iCalendar semantics (RFC2445) to scheduling rules.

Posted by Richard at Sat May 1 06:14:16 2010
You make a good point about shell script inefficiency (repeated calls to grep,awk,cut etc).

Why not have a slightly larger bash (let's call it "busybash" in reference to busybox) that has some of these built in?

Bash already provides builtins such as echo, kill and test - why not expand the range to include grep,sed,ls,mv  and a few others.

(Bash does have support for loading extensions, but that's not really the point here)

Posted by codebeard at Sat May 1 06:16:20 2010
@ People suggesting replacing shell scripts with python/vala/haskell/whatever

If the goal is to retain easy debugging as /bin/sh provides, then replacing the scripts with another language is not going to achieve that. Part of the reason that shell scripts are so easy to debug and understand is that they are written in a very simple language that 90% of unix administrators can read and write. Replacing them with scripts written in your favourite scripting language, no matter how easy it seems to you personally, is bound to reduce the ease of debugging.

Actually, I am confident that most of the shell scripts can be removed without losing easy debugging with systemd. Here's why:

A quick survey of the init scripts on my system show six main functions (here ordered from most common to least common):
1) Process control (writing/checking PID files, signalling daemons)
2) Setting environment variables and daemon arguments
3) Checking to make sure certain requirements are met (kernel modules, other services, file existence etc)
4) Setting up a working environment (creating special files, setting SELinux contexts and file permissions)
5) Waiting and checking to see if something has completed or is running correctly. handling timeouts etc
6) Saving/loading states on shutdown/startup

Now, the reason that many scripts can be done away with is that much of this can be handled better by systemd.

Process control (1), the most common function of init scripts, is handled by systemd. And using the systemd utilities, instead of having to hack around with PID files, we will be able to see exactly what's running and what's not and manage all of this in a consistent manner.

As for setting environment variables and daemon arguments (2), I think this one needs to be thought about more. I think it should be possible to handle it for 90% of shell scripts, but I will post another comment about that in a bit.

Checking requirements (3) can be handled in most cases by simply setting the correct dependencies for the service. For dependencies on kernel modules, this should be a defined unit in systemd.

Sometimes a script will check requirements such as certain paths for the service data etc, but I would say a lot of this is simply in the init scripts to be distro-independent and that actually many of these checks can simply be removed. That is, if I am using a modern distro with package management, the data files will always be at location X, which is also more or less guaranteed to exist if the package hasn't been messed with (otherwise all bets are off anyway). If people have moved things to non-standard places or removed config files or something, then they should be responsible for making the appropriate changes to the service definitions.

Many checks are a little unnecessary anyway, in the sense that if something is wrong, the service should die gracefully and give the appropriate error messages, instead of duplicating these checks in the init script. Where checking config file syntax or file permissions may be useful is where you want to restart/reload a service; it's better to get a message that you made a typo in the config file than for the service to shutdown and then fail to start up again. So perhaps this case can be handled by having a PreReload/PreRestart parameter in a service definition for running a program/script to check things in this case.

Setting up working environments (4) should really be handled by either the service itself or by post-install scripts of the distro's package. The remainder of cases can still be handled by scripts. Setting the right file permissions and security also falls into the category of should-be-managed-properly-by-distro.

Waiting for things and handling timeouts (5) should ideally be handled by systemd. If configured to, it should try to restart a service that dies, possibly retrying a few times before giving up and putting the service in a maintenance state (like in Solaris).

Saving and loading states (6) should be handled by the service itself.

The scripts that are the real culprits for being inefficient are the ones that don't actually manage daemons but instead set up whole subsystems such as networking and file systems, with all the hacky config file grepping. It is nice to see that at least file systems will be handled natively by systemd. Maybe networks could also be handled, or perhaps systemd can be integrated with networkmanager or whatever.

Posted by codebeard at Sat May 1 06:18:50 2010
As for the exact way of debugging things with systemd, I don't know how it currently works, but I assume that the following will be possible:
- A log of services that started, commands that were run, etc and what chain or dependencies, events or other relationships caused them to be started
- Trace what happened to the sockets that systemd made for a service (did the service ever take control of it, etc)
- Force the serialisation of starting some or all of the services for tracking down race conditions
- Set an arbitrary script or command to run before or after a certain service or every service. This should satisfy those people wanting to be able to hack shell scripts.
- It would be completely awesome if you could "connect to" a service which hasn't forked itself into the background, with the ability to read the program's stdout and stderr in real-time as well as possibly interact with the service through its stdin. It would be even more awesome if you could set an option in the service definition which would start the service with its own pts so that you could connect to it and interact with it using a screen-like program -- some services give you a nice debug console when you run them in a tty so this would be great.

Posted by codebeard at Sat May 1 06:20:33 2010
I think it is important that systemd have some understanding of providing things in a timely fashion. For example, if it sets up an AF_INET socket unit but the service never manages to start properly (for example, if it hangs somewhere during startup), then eventually the buffers for a UDP socket are going to fill up and incoming connections will time out for a TCP socket. On slow systems this may actually mean that trying to start every service at once will lead to intermittent failures with services trying to connect to another daemon but timing out.

For example, let's say I have a web application (in apache) that needs to connect to a mysql socket (AF_UNIX). So, systemd creates the AF_INET socket for apache, and the AF_UNIX socket for mysql, then starts both services simultaneously. Let's say my database is pretty hefty, and mysql takes 40 seconds to get everything started (keep in mind that maybe another 10 or 20 services are also trying to start at the same time, so this isn't unreasonable). In the meantime, apache has only taken 3 seconds to start, and it takes control of the AF_INET socket that systemd made for it and users are now able to connect. However, a user that connects to it just after this will get a messed up webpage with errors about a MySQL timeout since the timeout was set in PHP to 30 seconds.

Can anything be done to avoid these issues?

Posted by anonymous at Sat May 1 06:27:48 2010
Seconding lua.

It's the ideal scripting language for booting utils:

1. Fast and portable. Less startup times than shell with more functionality, with less worries about bashisms/kshisms/sticking to POSIX.

2. Easily augmented via it's C API... more close to the metal than Python and Vala.

It'd be great to run this from a stripped down initrd with only lua, glibc+libudev+etc. and a fallback dash shell.

At the very least it should be seriously considered as the config language, instead of plain text freedesktopish files that aren't as easily augmented.

Posted by codebeard at Sat May 1 06:36:14 2010
Okay, to add one final comment for now, I wanted to ask this:
Instead of rewriting daemons to use some extra file descriptors given to it, wouldn't it be possible to create the socket and then transparently hand it over to the daemon when the daemon tries to create it? It may require a kernel patch, but wouldn't that be a lot more elegant? Even legacy or closed source daemons (or open source daemons with uncooperative developers) could be made to use a socket from systemd this way.

For example, let's say that modifying service foo to use a socket from systemd is not practical. So, systemd sees that the kernel supports this feature, and creates a socket /var/run/foo/foo.sock before starting the foo service and informs the kernel about it. As the foo service initialises, it makes the syscall to create /var/run/foo/foo.sock, and instead of receiving an "already exists" error, it will transparently be given the socket already made for it by systemd. As far as the foo service is concerned, it had just made the socket, when really it was made by systemd. They all live happily ever after.

Is there some reason that this couldn't work? Surely something like this would greatly reduce the amount of work that needs to be done to get systemd doing useful things on current systems.

Posted by codebeard at Sat May 1 06:52:05 2010
Looks like I missed posting one of the comments I had written earlier. Oops. Really this is my last comment for now.

There needs to be an easy way to set environment variables and daemon arguments that avoids having to run any kind of script, in any language, if possible.

In the current init script system, there is usually a file like /etc/sysconfig/blah for the blah service which is included in the the init script. It will define environment variables for the service and may also be used to set certain parameters on the commmand line for the daemon. It would be great if systemd could understand some of this. Setting these things in the service definition is not really enough, or at least it needs to be possible to override options in a file made for end-user modification. Users should not need to modify the service definitions for routine configuration. In other words, there needs to be a place that users can look in to change options for a program (e.g. which port a program runs on) without having to mess with the service definitions.

To facilitate this, perhaps systemd needs to understand a basic kind of variable, so that it can be used in a service definition.

That is, you might have foo.service:
[Service]
ExecStart=/usr/sbin/foo ${domain!} -n ${connections:10} ${debug?-d:-f} ${extra_args}

Where systemd would consult some /etc/sysdconfig/foo or something and read in any values before parsing the foo.service file.

It might look something like:
# comments blah
domain=example.org
connections=5
extra_args=--cores ${connections}

Above I use some possible syntax for these, such as ! for saying that the domain variable is not optional, the : for giving a default value if it is not set, the ? for treating the variable as a boolean (yes/1/true and no/0/false/unset) and inserting one value or another.

Of course the exact details of this would need to be considered carefully to try and cover a large range of cases (the goal might be to be able to be able to supersede 80% of the init scripts in the default installation of some distro).

Posted by Christoph at Sat May 1 10:26:31 2010
Lennart, once again you have proven to be a genius! This is a pretty long article, but everything is well explained, logical and easy to read.

I am surprised how far systemd has come by now and I think it has great potential. Looking forward to read more about if!

Posted by Peter Lister at Sat May 1 12:36:34 2010
@codebeard

Rewriting the daemons is a good thing!

Too many daemons still have stupid amounts of command line config, or require coddling with bash.

Daemons should just start, find their configs and get on with things without holding everyone else up.  I certainly do NOT want a kernel hack just because software authors can't be bothered to improve...

Posted by Richard Brooks at Sat May 1 21:02:52 2010
Excellent work. I hope the Upstart developers see the superiority of your solution and will help adopting systemd as the new standard PID 1.

Posted by codebeard at Sat May 1 21:15:27 2010
@ Peter Lister

How is having to rewrite daemons a good thing? Most of them are perfectly fine as they are, as well as being written well for portability. Most do not require stupid amounts of command line config or huge bash scripts. If we can have all of the benefits and keep the simplicity of the system, without having to patch every daemon, then systemd can be adopted much more easily.

My proposal to have the kernel copy the already created socket when a daemon bind()'s is really no different from mounting a file system when a program does an open(). So, I wouldn't call it any more of a hack.

Here's what I envision:
systemd:
fork();
sock = socket(); [e.g. 3]
bind(sock, addr);
fcntl(sock, F_INHERIT);
exec();

daemon:
...
sock = socket(); [e.g. 4]
bind(sock, addr);
...


Now, if addr matches the addr from a previously defined socket of the same type and with F_INHERIT, then the kernel copies the appropriate data structures (including any connections already made to the socket) from the socket 3 into the socket 4, and removes socket 3. This process of searching for a matching socket is only done for processes which are marked with having at least one F_INHERIT file descriptor.

Posted by Adam York at Sun May 2 00:50:08 2010
Sounds great.  My only worry is that this will take you away from Pulseaudio development.  My worries justified?

Posted by sam at Sun May 2 01:25:58 2010
To the commenter who suggested systemd was a bad name:

From the wikipedia entry:

System D [in French, Système D] is a shorthand term that refers back to the French word débrouillard[1]. The verb se débrouiller means "to untangle." The basic theory of System D is that it is a manner of responding to challenges that requires one to have the ability to think fast, to adapt, and to improvise when getting the job done.

That sounds just about perfect to be honest "untangling the boot process" - yes please.

Posted by horse at Sun May 2 06:20:48 2010
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." --Brian Kernighan

Posted by Dude at Sun May 2 07:39:37 2010
WOW! Lets get this baby mainstream. Hopefully it will be easy to adopt to any distro.

Posted by Charles at Sun May 2 17:11:24 2010
@codebeard

Such a thing could be accomplished without resorting to a kernel hack, by using an optionally enabled LD_PRELOAD hook instead. Interesting idea...

Posted by David Björkevik at Mon May 3 01:06:26 2010
If the session manager would start using cgroups to kill off all the users' processes on session end, will this not break screen(1)?

Posted by Diego Calleja at Mon May 3 01:50:35 2010
Shell scripts are slow yes, but nobody has been able to prove they are a big bottleneck when booting. Maybe systemd will be so fast that bash will become the bottleneck, who knows. But until then, this "shell scripts are bad" attitude doesn't really have a lot of sense. There are more important things to do than rewritting bash scripts in C, IMHO

Posted by Peter Götz at Mon May 3 02:49:43 2010
Lennart,
this looks quite interesting! I downloaded, built and installed your code on Ubuntu 10.04. It starts systemd as I can see, but I get the following error:

Failed to mount /cgroup/debug: No such file directory.

I'm new to control groups. Any obvious hints for what I'm doing wrong? Thanks in advance!

Posted by codebeard at Mon May 3 03:40:25 2010
@ Charles

Actually, an LD_PRELOAD hook wouldn't work because bind()/etc are system calls, not part of a library.

Posted by deitarion/SSokolow at Mon May 3 04:23:53 2010
@Dude: Careful. Aside from the whole "fix a broken part by wrapping it rather than replacing it" aspect, PulseAudio's biggest problem was distros adopting it before it was mature enough.

Posted by vitaly at Mon May 3 10:14:06 2010
For lean, clean, portable and reliable service initialization, see perp, "the perpetrator":

  http://b0llix.net/perp/

Posted by Sherman T Potter at Mon May 3 13:45:39 2010
I used to think  PID 1 was always init. Things change.  We have new Solaris servers at work. I found out the init process on Solaris virtual zones could be ANY PID number.

Posted by Chirs Carpenter at Mon May 3 15:13:24 2010
Aren't we basically heading toward a microkernel here? We're abstracting it to where all the services are controlled by one central process (kernel?) that watches everything and reboots anything tha crashes, etc. It definitely sounds an awful lot like a microkernel (Not saying this is a bad thing). However, maybe we should start taking another look at GNU hurd?

Posted by owczi at Mon May 3 15:14:35 2010
This is definitely the way forward. You've once helped to sort out the the mess the Linux sound servers were, now it's time to clean up another area - way to go! Of course - as long as the final solution is well balanced (running services vs. on demand services) and we don't get into the situation Windows has been in for years now: you log on and you see your desktop - which gives you the impression that it's ready to work, while it will be loading services for the next minute or so before you can actually use the system. This is GUI I'm talking about, but it does rely on system services that need to be running.

Posted by Tel at Mon May 3 15:45:41 2010
You are perfectly correct about upstart's event driven system being the exact opposite approach to what it should be. One of the big problems with upstart is that if something is wrong with the system (e.g. some important component in the chain is missing) then upstart can't tell you anything useful.

For example, you expect that FOO should be running so you type:

# initctl status FOO

And all it can tell you is that FOO is stopped or waiting. Won't tell you WHY, which is what you really need to know. This gets even worse because some of the events can be given arbitrary names that have nothing to do with the package that provides them, and nothing to do with the program that might be running. As a consequence, if the thing is waiting for one of these type of events, even when you know exactly what event you need, you still don't know how to make that event happen, or which extra package you might need to install in order to make it happen.

Even when the upstart system is working correctly, you still cannot ask for status information about events that have been emitted in the past. That's because events are not actually jobs, you can only ask for status about jobs.

----

Finally, on a completely different issue, if you get a script-heavy system starting all sorts of daemons and you boot that system on VirtualBox a few times, you will see that it boots amazingly fast. To me, this suggests that CPU is not even slightly the bottleneck at the boot process, so you can forget about looking for optimizations in running grep fewer times. I'm 80% sure that the reason VirtualBox can reboot a machine so fast is disk cache, implying that what really matters for fast boots is how many disk files you touch -- doing stuff in parallel is pretty much a waste of time when disk is throttling you.

Shell scripts are very bad (especially nested scripts) for forcing a lot of seeks because the shell just reads a line at a time (presuming your system is not smart enough to actively readahead).

Posted by mario at Mon May 3 18:48:42 2010
Sounds serious. Upstart is a nice start, but not really easy to understand. Dependencies are never visualized, and if it's broken (mostly dbus and udev), then a typical user is lost.

With systemd standardizing a lot more features, this might be less of a problem. But still there seems to be much wiggle room for even more fragility of the Linux boot process. Automatisms are welcome, but not at the price of complexity and transparency. If there is too much self-righteous "intelligence" in any system service and hence superimposed user control restrictions, this is detrimental to usefulness.

So, I look forward to give this a shot once available. However I have this bad feeling it will come at a price as well. (At least if Ubuntu developers package it.)
So give us somewhat more helpful man pages, and proper control to work around builtin tool "intelligence" when necessary. Don't hide more inner system workings in the opaque dbus realm. Don't want to debug boot failures anymore. TNX

Posted by Enrico Tassi at Tue May 4 00:12:06 2010
You should definitively call your software "pid1".

Posted by Anonymous at Tue May 4 16:03:22 2010
This is a sick trend. These developers with the "innovations" not only trying to ruin stuff that works, but also to stole my precious time.

They think I have nothing to do but to learn their heroin trips every year.

There is a stable knowledge, the stable tool. Such a big gain! There are literature, courses, a great amount of materials about this stable tools. And now another moron tries to cancel it all. What he expect from us? A "thank you"?

Posted by alex jumba at Tue May 4 16:59:18 2010
great work @lennart, systemd seems poised to solve some fundamental problems.

Just an observation/comment. By your admission, hardware and software are dynamic and changing constantly during runtime. Does it make sense to add to that list usage patterns?? Given that, does it seem appropriate to allocate resources (e.g. I/O nice) to processes during spawning alone and let them maintain those resource levels during their lifetime, even though usage patterns, just like the hardware and software, change? If the overall goal is to let the machine determine stuff by itself without much manual intervention (e.g. as with what systemd does with dependency graphs), wouldnt it be appropriate for named not only to monitor hardware and daemons as it does currently (ingeniously BTW), but also dynamic resource management (resource = I/O/CPU nice etc)?? This is what I understancd what an INIT 1 system to be, a process manager (which the kernel has "outsourced" to), much like pulseaudio/jackd is for sound and X/kwin for display.

What am I getting at?? There has been discussion lately about interactivity of (recent) linux kernels, stirred mostly by BFS/BFQ. Your explanation about wanting other apps to get init-like powers brought an idea. Since it is these "servers" (pulseaudio/X etc) which know which apps are in focus and thus are actively being used by the user, wouldn't it be great if some of these "powers" e.g. renicing processes, be done by these servers (either directly or through giving advice/hints to systemd). This way, even resource allocation becomes dynamic (e.g. determined by user's usage patterns) and thus solve the problems with latency and such; (I remember windows has a setting for giving priority to foreground/bachground processes). This way, you don't have to always set slocates i/o priority to be lowest, you dont even have to set it AT ALL, and the system will automagically adjust itself for the workload.

Posted by Nazo at Wed May 5 00:24:26 2010
IMHO all major daemons will be in kernel space in future for wishing throughputs, latencies and low powers. I believe userspace daemon are completely replacable by kernelspace daemon.

IIRC kernel can use simd (RAID driver uses it) and floating point (kernel_fpu_begin/end) with a bit taskswitch slow down like userspace application. Also kernel can use userspace memory, preemptable threads and executing userspace application.

IIRC bypassing MMU is about 1.5x faster on x86. Some minor architectures may have problem because it has different operation set between kernel and user space. I don't know this is critical or not. But there are already some daemons like knfsd.

I want to see in-kernel fastest implementations for init, udevd, modprobe, mount, fsck, dbus and... anything!

References:
Unleashing SSL Acceleration and Reverse-Proxying with Kernel SSL (KSSL)
ttp://www.coresecuritypatterns.com/blogs/?p=1389
Kernel  D-Bus
ttp://www.mnementh.co.uk/home/projects/collabora/kdbus
TUX web server
ttp://en.wikipedia.org/wiki/TUX_web_server
[RFC] Unify  KVM kernel-space and user-space code  into a single project
ttp://www.gossamer-threads.com/lists/linux/kernel/1202521
Kernel APIs, Part 1: Invoking user-space applications from the kernel
ttp://www.ibm.com/developerworks/linux/library/l-user-space-apps/index.html
Re: ABI change for device drivers using future AVX instruction set
ttp://kerneltrap.org/mailarchive/linux-kernel/2008/6/28/2285574/thread

Posted by PaulWay at Wed May 5 05:42:22 2010
This looks great!  I really like the idea of deferred start-up of services, combined with an xinetd-style socket holder.

One thing that would be interesting to look into further would be to then shut down services after they go through some period of inactivity, or when pressure from other services for resources goes over a threshold.  If we've got a machine that only gets SSH connections once a week or so, why not shut down SSH after an hour or a day and give that memory to the database or web server.

Obviously this is a different use case from the initial purpose of systemd as you've stated - which is to speed up startup by not starting things until we really need them.  But I see it as an equally valid purpose for systemd, and it already takes care of some of this stuff for suspend and resume.

Alternately, maybe when the system is idle after starting up we could start those services who had been deferred?  If we want to be able to SSH into the machine with little delay, we could actually start up SSH so that it's ready to go.  Then swapping can handle the memory pressure problem above, as it currently does.  We've allowed critical things to start up as fast as possible, but we've still got the current level of responsiveness after the whole process is done.

(Which reminds me unpleasantly of Windows' way of allowing the user to log in while the system processes are still starting up, providing the illusion of quick start-up with the pain of discovering that it's a terrible lie for those users that don't leave their computer to boot for five minutes.  But I think the tactics above are better than that.)

Have fun,

Paul

Posted by Aaron at Wed May 5 08:32:30 2010
Awesome work! I've been envisioning something like this for a long time... ever since I first sat down with Red Hat 6, actually.

I think a project like this is one more step toward the unified Linux Desktop that was supposed to happen so many times in the recent years.

In the coming weeks I plan to get this working on funtoo, and maybe gentoo also. I'll provide updates for anyone else who is interested in trying this as well.

-Aaron

Posted by Tim Waugh at Wed May 5 11:15:37 2010
Sounds really great.  One minor point:

"unless the machine physically is connected to a printer, or an application wants to print something, there is no need to run a printing daemon such as CUPS"

This isn't completely true.  You might well have a CUPS server on the local network providing discoverable queues and PostScript/PDF drivers for a network printer.

But really I suppose this falls into the 'physically connected to a printer' category, so would be adjusted somehow by whoever configures the system?

Posted by Anonymous at Wed May 5 16:07:58 2010
at first I really liked the idea of an inetd-like startup system. but then I got really suspicious, when reading the following points:

- udev-support in pid 1? really? can't you just make a socket with simple text-io for control and introspection?

- different unit types, templates, extended dependency and ordering support: sounds all overly complex to me

- include mounting and setting hostname. so it's not a startup system, it's a startup-mount-hostname system. what else do you want to include? what happened to the unix philosophy? the beauty of shell scripts is, that they are generic. you don't need one central monolithic system to support everything. and if you claim, this would speed up booting, proof it or it's not true. it sounds more like you have a hammer looking for a nail.

- someone already mentioned HURD. that's how do it in a clean and consistent way: whenever a resource is requested, a translator is started to provide it. as linux doesn't support this, everything you can do is create a huge and complex hack. in that case i think i stick with sysv. it has it's weaknesses, but at least it's simple.

Posted by Lennart at Wed May 5 17:53:37 2010
Luiz: I don't see how powertop could be of any use in this system.

Michael: I am not sure such a swap logic really belongs in systemd. There already is an external daemon for this (http://sourceforge.net/projects/swapd/) and I am not convinced that there is reason enough to do that inside the init daemon.

Joshua: yes, we had a closer look on most other init systems, see our comments about them in the text.

John: I think that is a matter of taste. We think it it is nicer to use file name suffixes for this, as then an "ls" can give you a better overview about the units defined.

Grahame: Keeping shell in the boot process just because it can be used for debugging doesn't strike me a good idea. Shell is not a debugging tool, it's a scripting language. We should provide proper debugging tools for the boot process instead. Example: the interactive boot systemd already provides (look for confirm_spawn= in the article above) is a very useful debugging tool since it allows you to single step through the entire boot process.

Posted by Lennart at Wed May 5 22:59:47 2010
PJ: If we move the startup logic that currently exists in the various init scripts into the init daemon or the service daemons themselves, then this will actually remove a lot of the fragility of the boot process completely. Hence I see little need to replace shell by any other language. And even with systemd it is still easy to hook some shell script into one the services being run, should you really need to (Just add ExecStartPre=/foo/bar/waldo.sh to the .service file and this script will be run before the main daemon. You can have as many of those scripts as you wish). So summarizing this: there will be less to debug if we have this in robust C code, we should provide actual init debugging tools instead of just a shell for the purpose of debugging (and we are already doing that), and finally, even with systemd you can still hook in a shell script should you feel the need to.

Posted by Lennart at Wed May 5 23:12:41 2010
nine: systems like Fedora's "readahead" already linearize the disk seeks during startup. That is a problem orthogonal (though certainly not unrelated) to systemd.

Davide: automatic shut downs should only be done when the service itself thinks it is idle, and that is kinda hard to properly deduce from the outside. That said I tend to believe that we should not do work we don't really know is necessary. And that means that we don't do the work of shutting down something unless we have a really good reason for it. And that something is "idle" is usually not a good reason. In the end a correctly written daemon that is not being used should have a minimal impact on the system: it would be swapped out and sleep in a poll(), hence not influence the system measurably. So in summary: when doing stop-on-idle, then the daemons must do that themselves, and in many cases I'd not even bother.

James: indeed SMF is not the only system that does proper dependency management. A few systems do that. However, neither Upstart nor sysvinit do, and that's why I mentioned this.

Paul: yes, the contract stuff is very much like cgroups, however cgroups are in many way nicer, since you can name them in an fs and so on. (But I guess Solaris people disagree with that...)

Robert: yes, we'd need to patch daemons. That is explicitly mentioned in the text (look for Writing Daemons)

Anonymous: recompiling things whenever you change a bit of configuration is slow and cumbersome and requires a lot of dependencies installed. To build a vala program you need glibc, a lot of the gnome stack, gcc and more installed, something you certainly don't want to have around on a small system, just because you want to patch one configuration line. I mean, I like Vala (in fact systemd includes client tools written in Vala), but I don't think it has any place in an init systen, sorry.

Posted by Lennart at Wed May 5 23:19:19 2010
Peter: the plan for system suspend is to create a snapshot, activate the unit "suspend.target" which shuts down some services via "Conflicts" dependencies and then afterwards we activate the saved snapshot again.

And hot-swap hardware should be handled like any other hardware being plugged in our pulled out: .device units are activated and deactivated for them.

Eric: Well, I am sure that everything has some meaning in some language of this world. Also, reading the Wikipedia article I got the idea that the term wasn't negative at all?

Colin: I'll keep that in mind for my next init system ;-)

Anon: see the FAQ section: we welcome every distribution that is interested in this.

Dieter_be: as pointed out above getting rid of shell scripts by no means means loss of debugging capabilities or that we make it impossible to hook in shell scripts when the admin wants to.

Posted by Lennart at Wed May 5 23:25:57 2010
Colin: we'll probably provide similar calls in systemd.

anonymous: I am not convinced that Haskell would be good in the boot process. Also see my recent comments here that we don't need a replacement for the shell in the boot process. Having good debugging tools and most of the code in the daemons themselves or the init system is a much better choice.

sztanpet: The same applies for Lua.

Claes: thanks for the pointer, I'll keep that in mind and investigate that.

Richard: bash is already slow enough, adding even more stuff into it won't make things any better. The whole approach of shell is just slow.  Whatever you do, the focus of shell is always to defer operations to subprocesses spawned off frequently. And that's just wrong. Also see my notes above regarding replacements for shells.

Posted by Lennart at Wed May 5 23:55:11 2010
codebeard: very good ideas and I agree with most of them. A few comments:

1), 2), 4), 5) are already covered by systemd.

The idea regarding exposing kernel modules as units is interesting, I need to think about that a little more. The first thing that comes to my mind though is that I am a bit afraid of creating the illusion we'd know the same dependencies between modules that modprobe itself knows. I am also not really interested in duplicating that dependency tree in any way. But yepp, I need to think about this more.

We have most of the debugging functionality in place already. There are logs, and we store away a lot of information what happened. We also offer a serialized, single-stepping, interactive boot, to track down issues. And you can hook your own shell scripts into services if you want to (see my comments above).

Your ideas regarding that screen-like pty handling would probably mean that we'd have to implement our own virtual terminal (i.e. parsing of VT100 terminal sequences and such). I am not convinced I want to have that in an init system. I think interactive services like that are the wrong approach.

What we however already support is that you can connect a service to an existing tty, such as a virtual console terminal.

Regarding the Apache/MySQL issue: If people want to avoid that they should probably just add a dependency. i.e. instead of having apache.service just depend on mysql.socket, it could be changed to depend on mysql.service, if you understand what i mean. But they should do that only locally, of course.

Regarding your suggestions to fix the kernels so that we don't have to patch the daemons: we actually investigated that in much detail, however this turns out to be really hairy to do. Because at the time of the socket() call in the daemon we don't know that a listening socket is already existing in systemd. We figure that out only at the time of the bind(), and that complicates things considerably, since we'd have two sockets existing by then which would have to become one, in all their properties, i.e. sockopts and suchlike. And that is far from easy. That said, this is certainly something we'd be happy to have in the kernel, even if we don't see that we ourselves will hack that up any time soon.

Regarding your comments about command line parameters read from /etc/sysconfig: I think daemons that rely on cmdline configuration like that are broken, and should probably be fixed to have a proper configuration. That said should it turn out that many daemons work like that we could probably add something similar to what you suggest. We'll have to see.

Posted by Lennart at Thu May 6 00:11:01 2010
anonymous: see my other comments on lua/haskell/vala in the boot process.

Richard: unfortunately they haven't yet... But I hope this too ;-)

Adam: RH pays me primarily for PA, not systemd.

Charles: the right place for that is the kernel. LD_PRELOAD hacks will always be just that ... hacks.

David: Yes, screen is an interesting point. It probably would have to be patched to get its own session which is then treated seperately from the session it was created from.

Peter: You need a newer kernel probably, that enables the "debug" cgroup controller. Building systemd is not easy probably.

deitarion/SSokolow: I disagree with your assessment on PA, see my other recent blog post about that.

Chris: this has nothing to do with a micro kernel. We just pull a few things together that have previously been done at various seperate places, such as init, the init scripts, inetd,  mount(8) or even cron.

Anonymous: well, you are welcome to continue using a systemd-less system if you are this conservative and think this approach is so wrong.

alex: changing process properties like that from the outside at runtime is always racy. If something like this is desirable then it should probably be done in the kernel or in the daemons themselves.

Posted by Lennart at Thu May 6 00:33:54 2010
Nazo: well, I am pretty sure not many people would agree with your thoughts.

Paul: as mentioned above I don't think that shutting down sshd in the case you describe really is advisable. We should minimize the work we do, and that includes not shutting down anything we don't have to shut down. A properly written daemon that is swapped out and otherwise just hangs in poll() is not measurable in the system otherwise, and certainly doesn't take away much RAM from other processes. (and sshd is a properly written daemon like this)

And regarding your suggestions about delaying some daemon startups until the CPU is idle: that would basically mean that we'd add another CPU scheduler on top of the kernel scheduler, which I don't think is a wise idea. Hence: what you want to do we should do with the existing kernel CPU scheduler: by using nice levels and scheduling modes like SCHED_IDLE/SCHED_BATCH we can tell the kernel that some job should be delayed as long as there is something to do. It might make sense to utilize that to priorize things when we start things in parallel. We'll have to investigate that further.

Tim: I know that at least for the mDNS case if we browse for a printer the replies should be available in less than a second. Also, I believe the gnome printing dialog has a live view on the printers found, right? If that is the case and all other browsing protocols are as quick as mDNS then it should be OK to start cups only when the printing dialog is opened: it might show no printer in the beginning, but after a second it should be populated fully. I'd argue that this user experience would be acceptable to the user, if he even would notice at all.

Anonymous: hmm? udev uses a very simple protocol that is mostly text-based. Or did you mean "dbus" when you typed "udev"? Well, I don't buy into dbus hatred. D-Bus is just an IPC, it is much better using a well-known, well-analyzed, standardized and introspectable IPC like D-Bus then have each and every single service come up with its own homegrown IPC. Also, Upstart relies on D-Bus the same way systemd does.

And systemd is called "systemd". We want to manage the system, that's why we called it that way. And setting the hostname and mounting file systems is a core part of the system, and hence we integrate it into systemd.

I don't buy into Unix philosophy. Unix is broken. It might be one of the better system designs of all those existing, but that doesn't mean it wasn't broken too. We need to fix it and improve it where this is necessary. Strict Unix traditions or POSIX compliance hold us back, and are conservatism where progress is needed. Unix can inspire, but it is unsuitable as a dogma for system design 30 years after its inception.

Posted by Anonymous at Thu May 6 03:19:04 2010
Lennart: yes I meant d-bus and i don't use upstart. and yes, unix is broken, but the philosophy is right: make the tools simple and use plain text. whenever something adheres to this, it is a pleasure to work with. it is sometimes amazing how you can use these tools for things that no one has thought about, when they were created. and they still allow you to do something in emergency situations when everything else fails. and this is still true after 30 years and will still be true in the next 30 years to come. as soon as things like d-bus, or xml for that matter, come into play, it becomes a real PITA. I could give many example from my own experience, but then that post would become very long.

Posted by Lennart at Thu May 6 03:33:28 2010
Anonymous: Good for you that you don't use Upstart. However, all distros have now switched. All big distros now use D-Bus from the beginning of the boot process on. And introducing systemd does not change that fact in any way.

Anyway, I don't believe in the Unix philosophy. Sorry for that. The discussion about Unix philosophy is mostly off-topic however and hence should not be continued here.

Posted by Tim Waugh at Thu May 6 10:58:30 2010
Re: CUPS

What I'm trying to say is that there may not be a local client.  cupsd is a network service as well as serving local clients, and so its socket may never be connected to.  Network clients are other cupsd instances (which yes, systemd may start when the user sees the Print dialog), which will just wait to hear UDP browse packets from the cupsd running on the server.  These packets are only sent once every minute or so.

I really like the system, I am just struggling to see exactly how cupsd fits it and can benefit.

Here is the system I'm worrying about:

PrinterA }
PrinterB }- server (running cupsd)
PrinterC }

cupsd on this machine has been configured to know about these three network printers, and has been told to advertised them on the local network.  This is a common situation because network printers by themselves are not always easily or consistently discoverable across the whole group.  Some may support mDNS, some may only support SNMP, etc.

On this server machine, no-one ever logs in.  When someone wants to print, they do so on their own client machine:

clientA }
clientB }- server
clientC }

All of the machines above are running cupsd.  The client cupsd instances discover the queues advertised by the server cupsd instance by listening out for UDP browse packets, which it sends periodically, about once a minute. (Yes, ideally this would be mDNS, but right now it isn't.)

So now imagine they all switch to using systemd, with no other changes.  Someone on clientA is looking at the File->Print dialog, meaning GTK+ has just connected to the local cups UNIX domain socket and started the client cupsd instance.  That will sit there waiting to hear about any network CUPS queues that are being advertised.  But nothing will start the cupsd instance on the server.  CUPS queue discovery is passive.

Even if the user in charge configured systemd to always start cupsd on the server (can that be done?), the clients will still have to wait up to a minute the first time they ever use the print dialog.

Of course, CUPS caches information about network CUPS queues so that it doesn't have to wait at all after starting if has it seen those UDP browse packets before, so subsequent File->Print dialogs won't see the same delay.

So it comes down to:

1. Can systemd be configured to always start a particular service for which it cannot know whether there are clients, such as cupsd when used in this way?

2. Even better, can it be configured to automatically discover whether a particular service needs to be "force started" like this?  For example, I can imagine a small program to read the CUPS configuration file and see if it is configured this way, and tell systemd to act accordingly.

3. As things currently stand, there will be up to a minute's delay on each client the first time they use the Print dialog.  This will only be gone once CUPS switches over to using mDNS as its primary discovery/advertisement mechanism (which is planned).

Posted by codebeard at Thu May 6 16:37:27 2010
@ Lennart

Thanks for taking the time to reply to everyone! It looks like the correspondence generated by your blog post has been considerable.

Regarding patching the kernel to copy the socket on bind(), you say that it is really hairy to do all the copying and stuff, but perhaps I am missing something. Correct me if I'm wrong, but doesn't the kernel have this functionality already, in the form of dup2()?

Here's a small test:
parent.c
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <unistd.h>

int main(void) {
  int sock;
  struct sockaddr_un addr = {AF_UNIX, "./socket"};
  unlink("./socket");
  sock = socket(AF_UNIX, SOCK_STREAM, 0);
  bind(sock, (struct sockaddr *) &addr, sizeof(addr));
  /* pretend that we just did something like:
  * fcntl(sock, F_SETINHERIT, 1);
  */
  listen(sock, 10);
  execl("./child", "child", NULL);
}

child.c
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <unistd.h>

int main(void) {
  int sock;
  struct sockaddr_un addr;
  int i = sizeof(addr);
  sleep(10); /* simluate startup time */
  sock = socket(AF_UNIX, SOCK_STREAM, 0);
  /* pretend that we just did:
  * bind(sock, &addr, sizeof(addr));
  * and that the kernel checked the addr structure for
  * matches with F_SETINHERIT fds, then basically just
  * returned the following ( but with F_SETINHERIT off,
  * so that future calls to bind() will know they don't
  * have to search for things ):
  */
  dup2(sock-1, sock);
  listen(sock, 20);
  accept(sock, (struct sockaddr *) &addr, &i);
}

When I compiled and tested the above code with a sample client, it worked perfectly. dup2() must already do all the necessary locking and stuff that needs to be done to copy the file descriptor, so it's really easy. The only side-effect is that we waste a file descriptor (two fds will end up referring to the same socket), but that's really a minor issue and could probably be fixed if anyone cared.

Posted by antrik at Thu May 6 19:15:32 2010
There are some good ideas in here, and it's definitely a step in the right direction. (Especially compared to upstart.) The most important ones in particular:

- Creating sockets before launching the daemons: sounds like a very nice and useful idea :-)

- Starting services on demand: very important property missing from most other init systems. Very much like passive translators in the Hurd :-)

- Using cgroups for managing resources allocations etc. in a hierachical manner: again, a very good approach -- very similar to the one proposed by Neal Walfield (of Hurd fame) in his Viengoos papers, see http://walfield.org

- The observation that init and session management are closely related is good too -- though it's only mentioned as an afterthought... I believe the whole init system should be built on the idea that it's really just a case of hierarchical session management.

On the negative side, there is a major contradiction between "we don't want portability" and "we'd like all major distributions to adapt it": Debian also has Hurd and FreeBSD ports -- so without portability, it's pretty much out of the question there...

I'm not saying it's bad to use system-specific features, if they really help -- on the contrary, I believe this should be done more often. (The Hurd's unique features for example are pointless, if nobody ever uses them...) However, I don't see why you wouldn't want to accept alternative implementations of various functions upstream.

Of course other systems ideally would use solutions tailored to their specific functionality (I already mentioned passive translators in the Hurd) -- but often the resources for reinventing everything are simply not there; and thus adapting existing solutions can be important. (Also, reducing transition costs.)

The ability to mix and match various components is one of the major strenghts of the free software world IMHO: it faciliates competition and innovation. It is what allows the best solutions in any particular area to rise to the top. Tieing solutions to particular environments prevents this.

There are some other good ideas and caveats, which I will skip here, as it would be too much. (I should blog about this, but I don't think I'll get around to it any time soon... :-( )

The real showstopper however is, "I don't buy into Unix philosophy." Ouch. Just ouch.

(Well, obviously the showstopper not being the fact that you mentioned it in a followup comment -- but rather the fact that it shows in various places in the article, and your comment confirming that this is by design...)

It's a pity to see something implementing so many good ideas, disqualified in such a manner :-(

Posted by Lennart at Thu May 6 21:02:29 2010
Tim: yes, for cases like that it is possible to start CUPS regardless whether a local client or local hardware actually are around. It's a matter of simply adding a symlink to the .service file to some directory (instead of or in addition to the symlink to the .socket file). We probably should decide later on whether CUPS really is a candidate for on-demand loading like originally pointed out, or whether we can leave it to the user to fix the link, or whether we can teach CUPS itself to create that symlink.

Posted by Lennart at Thu May 6 21:09:17 2010
Codebeard: Well, after the bind() we'd have to return to the application a socket that is the merged version of both our systemd socket AND the socket the daemon created itself. We need to have the queued connections from the systemd socket, but all the various sockopts/fd flags/SIGIO handling/yadda yadda and so on that might have been set between the socket() and the bind() in the daemon itself. That basically means we need some non-trivial code in the kernel that can merge the fd and copy all settings over; it's more than just a simple dup(). I do believe that having something like this in the kernel would be great, but it's nothing we can hack in a couple of hours, unfortunately.

Posted by Lennart at Thu May 6 21:20:47 2010
antrik: if some distros care about portability to non-Linux systems then they can deal with the problems that creates, I see no reason to make that my problem. If we cannot make use of the unique features Linux provides we cannot do much what we are doing now in systemd.  One example: cgroups is at the heart of what we do. If we want to provide compatibility with other systems we would not be able to use cgroups. And that would be a big loss. Also, if you try to keep compatibility with other systems, you need to abstract the system-specific behaviour. And that adds code you need to maintain. And before you can add support for some OS specific feature you always have to abstract it. It costs a lot of time. One can certainly do that for normal applications easily, since they use only very few OS-dependent functionality. However, that is different for something as low-level and fundamental as the init system.

And I guess we have to agree to disagree on our belief in the holy grail that is Unix philosophy. If you reject everything coming from folks who didn't drink the Unix cool-aid, then I guess I am sorry for you.

Posted by Claes at Thu May 6 21:55:43 2010
Regarding iCalendar semantics as I mentioned above, I think not so much of the file format and the various mostly human based "event types" it discusses. I think of the way it defines scheduling in time, especially recurrence.

If systemd "understood" recurrence the same way as calendar apps do, it would theoretically be possible to plan, schedule and visualize events with existing calendaring applications.

cron applies a different system for recurrence and I can't say which is better or worse, but recurrence rules can be confusing and difficult to define. There are probably more tools that uses iCalendar principles regarding this. A good design could implement both.

Posted by Walther at Fri May 7 10:45:16 2010
You started out talking about socket/dbus-activation a lot but later you talk a lot about explicit dependencies in configuration files. Do all dependencies have to be defined explicitly? Or is the intent to use mainly socket/dbus-activation and config files for the rest?

It would be really cool if systemd would detect dependencies on the first boot and would use them to start services in parallel before they are needed on the consecutive boots. (Maybe this is exactly what you are doing but I didn't get that :)

For instance: systemd starts gdm. gdm starts X through socket activation. After X has started, gdm starts LDAP through socket-activation. Which means that LDAP is started after X has completed (which is not optimal). systemd logs the activations and so on the next boot systemd starts gdm, X and LDAP in parallel before they are activated.

Posted by Lennart at Fri May 7 20:21:37 2010
Walther: yes we thought about something like that, and would be relatively easy to do that. We'll play around with that and add it if it really turns out to have a positive effect on boot time.

Posted by Luca Bruno at Sun May 9 12:23:29 2010
Not writing scripts in vala (would be overkill with a compiler), but what about systemd itself in Vala? It's not as you said needing lot of gnome stack, you can use it with Posix profile (i.e. no glib).
It has a great support for dbus servers, except you need dbus-glib there.
Btw good work.

Posted by Lennart at Sun May 9 14:53:40 2010
Luca: Vala is not OOM-safe (because GLib isn't). However the init daemon is one of the few pieces of userspace code that should be able to deal with OOM.

Posted by Luca Bruno at Sun May 9 23:26:35 2010
@Lennart: as I said, you can use Vala without glib

Posted by Luca Bruno at Sun May 9 23:32:51 2010
@Lennart also, now that I remember, in Glib you can change the vtable of memory setting your own allocation functions, including malloc: http://library.gnome.org/devel/glib/unstable/glib-Memory-Allocation.html#GMemVTable

Posted by Lennart at Mon May 10 00:00:35 2010
Luca: No, the code Vala generates uses GLib and GObject heavily. In fact, the Vala object model is the GObject object model. Vala is unable to generate code without GLib and there is really no reason for supporting Glib-less binaries for them.

GLib code assumes that malloc() aborts on OOM. You cannot just sneak in a non-aborting malloc() and assume all the right OOM code paths magically appear, because they don't.

Posted by Luca Bruno at Mon May 10 00:13:21 2010
Lennart, Vala 0.8.1 (and since many other releases before that), is able to emit code without using glib with --profile posix.

What does it mean that glib code assumes that malloc() aborts? Glib code uses g_malloc, which calls a vtable.malloc() and does no assumptions on that. So if you create a malloc() function that does not abort, yes, it works.

Posted by alteclanding at Mon May 10 16:24:07 2010
Why do people in the open source world keep reinventing the wheel is something I'd never understand. There's fefe's minit, it works great and I have absolutely no idea why no one uses it.

Posted by Lennart at Tue May 11 01:05:19 2010
alteclanding: Why do commenters in the open source world keep posting comments even though they obviously haven't read the story or even understood it is something I'd never understand. There's alteclanding's comments, they are nonsense and I have absolutely no idea why he's posting them nonetheless.

Luca: interesting, didn't know that. What object model are they using when compiling without gobject?

Simply making malloc() non-abortive doesn't change the fact that nothing that internally calls g_malloc() in glib actually checks for it to return NULL. An example: http://git.gnome.org/browse/glib/tree/glib/glist.c#n283 -- That's one of the most basic data structure operations in GLib, and what you can see there is that memory is allocated and that is assumed to succeed. Right after allocating the data structure is accessed. Would a malloc() implementation return NULL sometimes there this access would immediately cause segfault. And that is why glib is inherently not OOM safe, and it is completely
irrelevant what allocator you plug in there: the OOM handling codepaths are simply not existing. And that is actually a good thing, as I have pointed out here: http://0pointer.de/blog/projects/on-oom.html

Posted by Luca Bruno at Tue May 11 10:50:57 2010
Lennart: it simply create structs without using gobject, of course there's no reference counting... it will free an owned object as soon as it's not used. Methods and variables are glibish, i.e. my_struct_method (MyStruct* s);
It doesn't support inheritance.

For the OOM thing I've got what you mean, I thought you could have done recovery inside the custom malloc itself, but still abort if recovery fails. Clearly it's not your case reading the code.

Posted by Tobu at Mon May 17 15:42:17 2010
Here is some interesting feedback:

http://etbe.coker.com.au/2010/05/16/systemd-init/
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=580814
http://groups.google.com/a/chromium.org/group/chromium-os-dev/browse_thread/thread/d146c73e42fc0e7b

Posted by Martin Sivak at Wed May 26 12:41:56 2010
Sorry Lennard but I can't disagree more with your stance towards unix philosophy.

Having multiple tools (bash is ineffective, but that doesn't mean we have to merge half of the base system into one daemon) where each does only one task and does it right has lots of benefits.

I like the reporting and control stuff in the idea of systemd when it comes to replacing init.

But I will stop right there. Having "xinetd like" configuration of system stuff.. why not. Pid files are piece of crap I agree with that. But why on earth are you trying to replace autofs, cron, mounting, xinetd?

Especially when you are reimplementing the "verified" functionality inetd and xinetd have had for ages now.. (the same applies for cron, autofs, ...)

What we should do is to improve and enhance these tools not write one new monolitic piece of code, which will be hard to maintain, hard to review hard to verify and hard to analyze from security stand point.

I would agree that starting hundreds of shell scripts is not perfect, but your solution is the opposite extreme. Starting couple of main daemons instead of the shell scripts won't affect the performance and it will still conform to the unix philosophy.

What is wrong on the following process structure?

init process (starting the main daemon set, taking care of respawning them and setting the inital environment)
|- improved xinetd for daemons
|- autofs daemon for automounting
|- cron enhanced of proposed reporting
|- udevd taking care of module loading and device dependent spawning

They can all even use some kind of common library to simplify the common tasks... if you want extreme solution, improve xinetd with dependency stuff and make it pid1 process.

You know, there are some of us who use linux on server machines. And we want to be sure that the machine is secure and that we can disable any particular piece.. kind of hard to do when we suddenly have only one piece which does everything. Especially when you find a security bug in that one monolitic piece of code..

Posted by Lennart at Mon May 31 21:35:28 2010
Martin: I am not trying to replace the existing automount daemon. It does a lot of stuff that systemd doesn't do and will never do (i.e. read automount maps from NIS or LDAP!).

And I tried to explain why I want some automount/mount/inetd functionality in systemd. If you cannot see that, then please read the blog story again.

And arguing against reimplementation of "verified" functionality means you eventually come to a complete standstill of development.

Also, I think you are overestimating the complexity of inetd and cron a little.

Also, by no means we want to get rid of udevd. Upstart has weird plans like that. Not us.

Posted by Eero Tamminen at Tue Jun 1 21:20:28 2010
@sjansen Even if the IO side would be handled by the kernel, one still has the problem that the processes generally spend quite a lot of CPU at their startup instead of idling and this means causes kernel to do a lot of scheduling which adds overhead.

I think it would be better to interleave the startup a bit instead of starting potentially hundred(s) of processes at once.  (after profiling the impact of course, preferably also on a single core netbook system)


cgroups: I have some doubts about putting every started process into a separate cgroup group.  That's fine as default, but it should be possible to put multiple processes to a same group so that their resources are handled as a unit.  Otherwise one can more easily run into issues on resource restricted systems due to resource waste when one has set e.g. separate memory usage limits on the groups.


setsid: If one gets rid of setsid(), how one can then make sure that started processes can safely kill their whole process group (to get rid of all started children, engine etc processes) without killing the parent (like systemd...)?

Posted by Eero Tamminen at Tue Jun 1 21:22:08 2010
D-BUS: The reasons why I personally "hate" dbus isn't its API, but dbus daemon implementation and usage.  Programs do all kinds of idiotic things through it; sending data on the whole session bus instead of just control information, send data in XML, subscribe to too many device status messages so that you get client "herd" wakeups etc.

And the daemon implementation is pretty awful.  D-BUS buffers messages without a limit instead of blocking message spammers until the sent messages are consumed.  What makes it worse is that D-BUS memory handling is at the same time incredibly inefficient at releasing the memory it has allocated (it fragments it) and too complicated to make much sense of it from Valgrind reports.

Posted by Lennart at Tue Jun 1 22:26:54 2010
Eero: you are overestimating the price of switching tasks today...

regarding cgroups: we now put each service into its own cgroup in a private systemd-specific hierarchy (/cgroup/systemd). With a very simple config option you can optionally add the process to arbitrary other groups in other hierarchies. So what you ask for is already covered.

regarding setsid: the point of what i wrote is that systemd calls setsid() for you anyway, so you don't have to anymore, and your call will fail with EPERM if you do call it nonetheless.

And uh, your dbus accusations are bogus.

Posted by Joe Nall at Wed Jun 2 22:11:42 2010
What is the plan for managing socket and process selinux contexts?

Posted by Lennart at Wed Jun 2 22:45:26 2010
Joe, systemd is not the first daemon managing sockets and processes. Which means we'll do it the same way as it has done previously for xinetd and other babysitters...

Posted by Anon at Thu Jun 3 21:56:44 2010
OK, I've read bits and pieces from all over the place about systemd so I apologise if you've answered these questions over and over.

1. systemd replaces the need for portreserve simply by design in a rather more robust fashion.
2. systemd can support dependencies but where possible dependencies should be avoided.
3. systemd has a (sysv?) mode where it starts all jobs at sysv levels? This can be used on servers or very conservative environments. It is presumably not possible to mix "implicit" mode with sysv mode?
4. Virtual dependencies are to be strenuously avoided. There will not be support on waiting on ntpdate/forced system time (this is considered to be a non-problem). There will not be support for waiting on all normal jobs finished/GUI idle after boot/start cupsd now.
5. udev events can be turned into dependencies. bluetoothd depends on the kernel having sent a bluetooth udev event at somepoint in the past? What about when the dongle is removed?
6. The "screen killed on GUI logout" is an unrealistic problem or will be manually solved by modifying screen?
7. Circular dependencies (A waits on B waits on A)  are non-problem or would be a problem anyway.

I'm curious as to when things like Xorg are started - do things like gdm enough sockets so it basically handled implicitly?

Posted by Lennart at Fri Jun 4 02:21:29 2010
Anon: 1, 2, 7 are not really questions but yes, you are right on those.

Regarding 2: for normal daemons dependencies are not really necessary. For stuff involved in early boot or late shutdown they are more likely to be needed though. The result of that is that OS vendors are probably the only ones having to deal with deps in the systemd scheme, and packagers and 3rd party vendors won't.

On 3: There is no separate mode for SysV scripts. We simply consider the SysV dirs an additional configuration source. You can mix SysV services with native services as you wish, and distros are expected to do just that during their transition period from sysvinit/upstart to systemd.

On 4: you can use dependencies if you want. We don't suggest you to use them for normal services though. But there's nothing that would stop you from ignoring us.

On 5: systemd won't ever shoot down daemons due to idleness, simply because it is very hard to figure out what "idleness" means from the outside of a daemon. We also believe that we should minimize the work done, and hence think that a correctly written daemon that is nominally running but effectively just swapped out and hanging in a poll() is nicer then constantly stopping and restarting services.

On 6: It's an option under the control of the administrator, whether he wants to allow stuff like screen to work, or not. In a university workstation environment he might choose to kill all the user's processes if the user logs out. On private systems he might want to allow that. We support both schemes, and leave it to the admin to choose. The default will be to allow screen however.

Posted by Lennart at Fri Jun 4 02:22:46 2010
Anon: X11 is actually difficult, since it's port numbers are dynamic. That sad MacOS actually starts its X server as soon as a connection is done on port 6000. We could do the same scheme.

Posted by Will at Sat Jun 26 20:03:20 2010
This looks like it would obviate the need for in-house proprietary unix job management daemons like AOL's venerable "samon".  Also, I like the idea of having a uniform method for stopping and restarting services. PID 1 is the perfect place to put this effort.  Thank you.

Posted by pada at Sun Jun 27 03:59:30 2010
In order to calculate the dependencies of kernel modules, I'd suggest to make use of modprobe's intelligence by executing
modprobe --list
modprobe --show-depends <module_name>
and use the output as an additional configuration source, as systemd already does with LSB headers from init scripts.

That way, systemd won't need to know about any modprobe configuration files, but will be able to figure out the right moment to load a kernel module and whether a module needs to be loaded at all.

One problem I see here is the time required to execute modprobe. Module dependency information should be cached and not determined on every single boot, but only on "depmod -a" events.

A different approach would be to use /lib/modules/`uname -r`/modules.* directly as an additional configuration source, but then systemd would be required to parse these files. Is there some standard for the syntax of these files?

Posted by Mark J at Wed Aug 4 08:07:42 2010
The majority of the details of this are a few college courses over my head.  But listening to your explanation of it on the Linux Outlaws podcast it was fairly easy to understand and generally sounded like an awesome idea.  So I just wanted to applaud your hard work!

Posted by bochecha at Mon Aug 23 11:20:20 2010
Thanks, this serie of article will no doubt be very interesting. :)

About this one, I don't really get the LOAD, ACTIVE and SUB columns.

As I understood it, the first one indicates whether a unit configuration was loaded or not into systemd. But if it wasn't loaded, then it would not appear in the output of systemctl, right?

You say that ACTIVE is a high-level generalization of SUB. In this case, why is that necessary? Isn't SUB already enough information?

Maybe if you could give the list of the possible values for each columns then that would help me understand the differences. :)

Or maybe just point to the appropriate documentation if that is all already documented somewhere, I must admit I haven't had the time yet to look at Systemd as closely as I wanted.

Posted by Lennart at Mon Aug 23 11:35:34 2010
bochecha: well, there are many reasons why a service might show up as failed to load in the systemctl output: for example, it was referenced as required dependency of another service, but we couldn't find neither a native service definition file nor a SysV init script for it. Or, there was a parsing failure while reading it. Or, because the file was incomplete. And that might even happen while a service is active, for example, because the user requested a configuration file reload from systemd after changing a service file, and a service that is already  running suddenly has an invalid configuration file. That effectively means that the LOAD and the ACTIVE state are mostly orthogonal: you may have a running service where configuration loaded fine, you may have a stopped service where it loaded fine, but you may also have a running service where configuration failed to load.

And yes, ACTIVE and SUB show you the same information, though ACTIVE in a more generalized form. While SUB has states that are specific to each unit type (e.g. "running", "exited", "dead" for services; "plugged" and "dead" for devices; or "mounted" and "dead" for mount points), ACTIVE exposes the same high-level states for all units.

We only distuingish 6 ACTIVE states (to list them: active, reloading, inactive, maintenance, activating, deactivating), which are mapped from the lower-level states, which might be many more. For example services have 15 low-level states: dead, start-pre, start, start-post, running, exited, reload, stop, stop-sigterm, stop-sigkill, stop-post, final-sigterm, final-sigkill, maintenance, auto-restart.

Posted by John Drinkwater at Mon Aug 23 12:23:36 2010
Why ‘systemctl status ntpd.service’ and not ‘systemctl status ntpd’?
Why does systemctl display names like ‘getty@tty2.service’ and not as ‘getty@tty2’ ?

Do we really need to have .mount, .service, etc on all our config files now?
IMO, horrible to have file extensions, equally to have them as long as the file name.

Posted by Lennart at Mon Aug 23 13:36:52 2010
John, we support different kinds of units. We manage sockets, mount points, services, devices, automount points, timers, paths, targets, swap files/devices and snapshots with the same tools, with the same commands. For example "dbus.service" and "dbus.socket" are both used by the D-Bus system, but can be controlled and introspected independently. To distuingish them, we hence write their full name everywhere, so that you explicitly state that you mean the D-Bus socket instead of the D-Bus service, or vice versa.

Also, I actually find this one of the pretty things in this design: the unit names are actually identical to the file names they are configured in.

Posted by Shane Falco at Mon Aug 23 14:19:27 2010
I'm with Mr. Drinkwater on this.  Extensions (especially long extensions) are one symptom of a bad design.  All this feels very rushed and hacked together.

It looks like this core systemctl function won't display cleanly in a standard 80 character wide terminal?  Are we trying to change linux so much that we no longer care about those sorts of things?  It may be different for gnome developers, but unix admins I know have lots of windows open and usually they're 80 characters wide.

Finally, why choose a name so close to another common utility?  systemctl?  Seriously?  When another core system utility called sysctl already exists?

Posted by Lennart at Mon Aug 23 14:26:44 2010
Shane, I am sorry but I guess we just have to agree to disagree to this. The points you raise are in the category "matter of taste" or even "bike shedding", and so I guess we should leave it as that.

systemctl shortens the output dependening the terminal size. If you use a tiny terminal, the description string might even be suppressed entirely. The bigger your terminal/screen is, the more output we can stick on it. That should not surprise anybody. Or to put it in other words: we support 80ch terminals just fine, but if you use bigger termiansl we'll make use of it.

Posted by Shane Falco at Mon Aug 23 14:49:26 2010
Sounds reasonable and I appreciate the response.  It looks like you are taking your own personal experience (which is all anyone can ask) and creating something that you think is appropriate.  But I fear that you don't really see the bigger picture of unix admins out there...there are a lot of guys I work with who are junior/middle guys who just work for a paycheck.  They're not linux geeks.  I dare say they're the majority.  They could be doing AIX or Solaris or linux for all they care.  I think they're going to have trouble with systemd.  It just does too much and it's too baroque.  Too confusing.

I finally, finally got them going with services/chkconfig and now this...

Posted by Michael at Mon Aug 23 15:00:08 2010
Just a quick question, can the description be translated ?
I assume that this is not planned, as they are config file, not software, but as we are able to translate .desktop, it would be great to have some way of doing it cleanly.

Posted by Patryk "patrys" Zawadzki at Mon Aug 23 15:07:40 2010
Any idea on when the systemd dependencies get released? Currently it requires unreleased stuff such as dbus-1.3.2.

Posted by Lennart at Mon Aug 23 15:10:54 2010
Shane, well, what makes you think that we haven't looked around ourselves? Also, we managed to get systemd accepted by Fedora, in particular FESCO. We managed to convince this technical committee that systemd is a good thing. Do you really want to say that Fedora as a whole is incapable of "seeing the big picture", but you are the only one who is? Maybe things are the other way round? Ever thought about that?

Also, note that systemd actually brings Linux administration much closer to how many of these things are done on Solaris. Much of what we added is inspired by SMF, and other init systems. That means the administrators should enjoy how we make things on Linux work much more like the other big server operating systems.

Posted by Lennart at Mon Aug 23 15:13:46 2010
Michael: it currently isn't translated, but the plan is to copy very closely the mechanism how .desktop files are translated (our unit definition files also use an .ini inspired format), so that we can reuse existing tools for this. This hasn't been implemented yet however.

Posted by Lennart at Mon Aug 23 15:20:51 2010
Patryk: I plan to roll D-Bus 1.4.0 by the end of this week. However I also plan to add a dependency on very new kernels to systemd, to make sure we can move the cgroup fs mount point to /sys. This means you have to either run an unreleased kernel or backport one patch to your older kernels, as we did in Fedora. So, basically by the end of this week the dependency on one unreleased package will go away, but we'll add another one instead. Sorry for that, but I don't think it would be wise to support the old cgroupfs mount point for longer, to make sure users don't get confused by that unnecessarily.

Posted by Paul Wise at Mon Aug 23 15:40:41 2010
Its a shame you missed the LCA2011 CFP deadline, I would have liked to attend a talk on systemd:

http://lca2011.linux.org.au/

Perhaps the organisers would consider a late submission.

Posted by lirqa at Mon Aug 23 15:51:43 2010
How fast will it be? How fast is the boot on your system?

Posted by Simon at Mon Aug 23 16:07:24 2010
Shane Falco, you are being dishonest.

Your concern is that this change would require you to learn new things and have to teach new things.

The way you should rephrase your questions is:

&#8220;Sorry for being off-topic; I am posting this on the For Admins post while my concern is really about "Does systemd offer so many nice things that justifies the change?". I would like to see the question answered: "What are the advantages of systemd that justify this big change? I did not search your previous posts on this subjest."&#8221;

Posted by Michal at Mon Aug 23 16:18:50 2010
"systemd has been accepted as Feature for Fedora 14"

Probably will also be in the new Ununtu 11.04 ;)

Thanks for your work!

Posted by Diego at Mon Aug 23 16:21:50 2010
What about gettext support?

Posted by Lennart at Mon Aug 23 16:42:09 2010
Diego: it's unlikely we'll use the gettext APIs inside of PID 1, simply because i18n data tends to be stored in /usr, and we try to avoid accesses to that, since some folks still have that one a seperate partition (even though it is crazy and misses the point). However, for the client tools this is differentely and w'll certainly reuse the framworks currently used by other projects, be it gettext or intltool, or the hacks to make .desktop files translatable.

Posted by Lennart at Mon Aug 23 16:47:53 2010
Paul, I actually submitted something to LCA, but speaking from experience I won't get funding for the flight. But at least I will be able to say "I have tried"...

Posted by Lennart at Mon Aug 23 16:50:32 2010
lirqa: see my comments regarding "speed" on http://lwn.net/Articles/401441/.

Posted by Lennart at Mon Aug 23 16:51:48 2010
Michal, it is unlikely that Ubuntu will acknowledge that systemd is the future and Upstart is not any time soon. Note that Upstart is a Canonical-funded project.

Posted by Michal at Mon Aug 23 17:21:50 2010
Lennart, Upstart was announced four years ago. Even main developer isn't satisfied with v0.6. I don't see any progress in their repo. I would not be surprised if they in the next year just switched to systemd. Canonical doesn't have enough people to develop something else than a new gnome desktop theme.

Posted by Matthew Jones at Mon Aug 23 17:37:33 2010
Lennart, I just watched the Debconf video about Debian looking to adopt Upstart.

The main issue that was stated for Debian not adopting Systemd, was their BSD kernel support. Will Systemd work with the BSD kernel? How backwards compatible is it for other Unix-like systems that are stuck with init.d scripts?

Posted by Lennart at Mon Aug 23 17:42:45 2010
Michal, after having talked to Keybuk a couple of times in the last months and acknowledging the fact he very recently still did talks on Upstart at Debconf and LinuxCon I fear that's wishful thinking, even if I too hope I am wrong on hat.

Posted by Lennart at Mon Aug 23 17:46:50 2010
Matthew, systemd is Linux-only. We have no plans to support niche kernels. That'd would severely limit our technical options and hold Linux back unnecessarily. If Debian cares about those kernels, it's on them to provide support for it. Note however, that Upstart doesn't work on those other kernels either and similar to us has little interest in supporting it. Note that nothing stops Debian to ship systemd on Linux by default and provide SysV compatibility scripts for the other OSes.

Posted by Omer Akram at Mon Aug 23 17:53:30 2010
Its my personal thinking but Upstart-1.0 is coming so tighten your seat belts.

Posted by Michal at Mon Aug 23 17:57:03 2010
Lennart, Wait until the Canonical bosses will read the sites with positive reviews of new Fedora/SuSe/etc versions. Phoronix probably soon begin to do some benchmarks.

When they see that people see systemd as a breath of fresh air and the upstart as a failure to meet promises - they will throw it away.

SJR can write his code for Debian for free ;)

Posted by Michal at Mon Aug 23 17:58:29 2010
Omer Akram, "Its my personal thinking but Upstart-1.0 is coming so tighten your seat belts.".

Where?

I don't see anything here
http://bazaar.launchpad.net/~scott/upstart/trunk/changes

Posted by Simon at Mon Aug 23 18:20:36 2010
Michal, could you please stop the trolling re: upstart?

Posted by Diego at Mon Aug 23 18:23:55 2010
Ouch...however, doesn't this help in some way? http://www.gnu.org/software/gettext/manual/gettext.html#Locating-Catalogs

Posted by Lennart at Mon Aug 23 18:30:11 2010
Diego, well I am pretty sure people would hate me if i'd start moving i18n data to /lib...

Posted by Omer Akram at Mon Aug 23 18:30:37 2010
>I don't see anything here
>http://bazaar.launchpad.net/~scott/upstart/trunk/changes

thats for a surprise

Posted by oiaohm at Mon Aug 23 18:38:48 2010
I think you over looked something in the PAM module/possible future feature.

Session disconnects and reconnects support.  This would be a great step forwards particularly if text based vt can be moved to X11 terminals and reverse.

Also a great feature for X11 servers in future.

Question currently I read systemd as starting system wide services.  Could it not be extended in future to also start and manage per user services like pulseaudio and jackaudio?

So spiting these services away from normal user processes and making it simpler for users to restart them and 100 percent clean them up in failure.  Service is a Service no matter where it running would be a good policy.  Also allow sandboxing of these services in a far more controlled way.  cgroups do process tracking to sandbox very well.

I guess systemd is fairly moduler.  Could hooks be added for smack LSM as well as SElinux?  Those are the two mainline LSM's that used labels.  Rest of the LSM's don't.  So really for full support of a Mainline kernel a user might load up supporting both is required.

I hope one day to see systemd with GTK and QT front ends.  Start of serous real graphical management distribution independent.

Posted by Diego at Mon Aug 23 19:02:27 2010
Why would Ubuntu switch so suddenly? Remember that systemd hasn't been deployed in any mainstream distro. They'll probably do it in the future, but...right now? Why would they even interested?

As for Debian...well...it's not like the rest of the Linux world is going to wait for them. If they want to continue pushing for GNU/kfreeBSD while ubuntu dominates the linux desktop and centos the free server market share, that's fine for them.

Posted by Nagilum at Mon Aug 23 20:45:09 2010
If ntpd.service would have emitted some error message while starting up, how would I display that using systemd?

Posted by Lennart at Mon Aug 23 20:49:05 2010
Nagilum: by checking the logs. The long term plan is to hook up "systemctl status" to the logs, so that you'll see the most recent log messages generated by a service next to the service. But until that happened we need to beef up syslog considerable, i.e. make it indexable and stuff like that.

Posted by Rainer Weikusat at Mon Aug 23 21:35:14 2010
The reason to separate /usr from / is that it
contains architecture dependent, shareable data.
And that's still relevant today because of
the possibility to have 'Linux containers' which
share everything shareable with the host
installation they run on. Of course, this also
needs the ability to easily customize system
startup, say, by deleting scripts which are not
needed for a container instance (root-fs of that
having started out and remove parts of existing
scripts which serve no purpose in a container
instance.

And no, I'm not 'crazy' because I happen to have
some experience with the servers I operate you
are quite obviously lacking.

Posted by Lennart at Mon Aug 23 21:38:11 2010
Rainer, I am sorry. But you are completely misunderstanding the /usr vs. / split. Also note that most commercial Unixes already got rid of the distinction and symlink one to the other. Please read up on things before calling me a noob. Thanks.

Posted by Simon at Mon Aug 23 23:06:05 2010
How does pam_systemd relate to ConsoleKit? There seems to be some overlap with regard to maintaining info about current user sessions...

Posted by Lennart at Mon Aug 23 23:11:36 2010
Simon, yes, there's a non-trivial amount of duplication between CK and systemd. Note that Jon passed on half of the maintainership of CK to me and there's something like a consensus of the people involved to fully merge CK (or something equivalent) into systemd, in the long run at least.

Posted by Rahul Sundaram at Mon Aug 23 23:11:54 2010
Simon,

My understanding is that ConsoleKit will be obsoleted by Systemd in the near future.  Lennart is a maintainer of ConsoleKit as well for the time being.  Other distros not using systemd can continue to use ConsoleKit I guess

Posted by William Lovaton at Mon Aug 23 23:30:46 2010
I'm really impressed Lennart!.  Congratulations for your hard work, I can't wait for Fedora 15.

Thanks.

Posted by Claes at Tue Aug 24 00:23:52 2010
I am excited to see so much progress. I don't have much to bring to the table, a few reflections only about the terminology.

Having a kind of status called ACTIVE, and one of its states called active as well feels weird. And to see a string like "Active: maintenance" feels confusing. Likewise would "Active: active". I think something like "Status: failed" would communicate the situation better.

Posted by Lennart at Tue Aug 24 00:42:21 2010
Claes, well, status is too generic, because we have the high-level and the low-level state, which we need to distuingish somehow in the interface. Onbe we called "active" state, the other "sub" state.

Also note that the word "status" (in contrast to state) is already used in the output of the exit status of the program.

Posted by Denice at Tue Aug 24 00:43:45 2010
I'm a little worried that anyone thinks Solaris' SMF is something worthy of copying.  I find it horribly over-engineered.  These days it is common to run virtual servers which do really only one thing (web server, or a mysql slave, or an ldap server).  I have a number of xen guests that list perhaps 15 'chkconfig-ed on' services:
chkconfig --list|grep :on

So from a system administrator's point of view, speaking of managing targeted servers and not multimedia desktops, I don't need anything complicated to manage runtime services.

You might want to seriously think about writing a tutorial for a typical small server (apache only, for example - no graphics, no bluetooth, no atd, no iscsi, etc.), and then convince us that systemd provides any value.

cheers, etc.

Posted by Shane at Tue Aug 24 01:49:13 2010
Denice said it better than I ever could.  As someone stuck with over a hundred Solaris 10 servers, I agree completely with her assessment.

Here's a nice little commentary on Apple's launchd which I feel is just as appropriate for systemd:

http://lowendmac.com/ed/winston/10kw/launchd.html

It's monolithic, it's "over engineered", and it does too many things.  In a nutshell, it's anti-unix.

Posted by Cameron Hutchison at Tue Aug 24 02:35:25 2010
"thus ensuring that everything ever logged on the system will properly end up in the log files"

Does this include timestamps being properly captured? When trying to debug delays with suspend/resume, the logs weren't much help since all the suspend and resume log messages had the same timestamp in the system logs.

Posted by Stan at Tue Aug 24 06:49:31 2010
A new init system is a great opportunity for distros to eliminate the minor (yet damaging) differences, so that a service written for one distro will be 100% compatible in another distro. A single code base also has the advantage of heavy testing and extermination of bugs.

By including special code for non-standard stuff like "SUSE extensions", systemd is just putting a bandaid on the problem instead of fixing it.

Posted by Anonymous at Tue Aug 24 06:57:59 2010
Would you consider writing more about the C-based init scripts?  I've had the general feeling for a long time that all distributions need to do the same small amount of work to bootstrap the early boot process, and I'd love to hear more about the common core you distilled it down to.  Obviously I can (and will) go read the C source, but I'd love to hear the higher-level view you've obtained by reviewing distributions.

Thanks!

Posted by Tomasz at Tue Aug 24 08:42:17 2010
oiaohm: user session support is in current systemd. For graphical insight look at "systemadm" (in fedora: systemd-gtk package).

Posted by Alexandr Kara at Tue Aug 24 10:28:37 2010
I must say I am impressed by the progress on systemd so far, but I am a little worried about one thing. You say that systemd requires a very recent kernel. Does that mean that when booted with an older kernel, it will just refuse to start? Or will it have some "compatibility" mode when it starts services in parallel and without using cgroups? Or maybe drop to old init (if still installed)?

Posted by Tshepang Lekhonkhobe at Tue Aug 24 11:54:20 2010
Lennart, rock on!

Posted by Karellen at Tue Aug 24 14:02:45 2010
@Shane:
[systemd] does too many things


It manages the startup and lifetime of system processes. That's it.

From the article you linked:

Merging periodically run jobs into the main system process doesn't make sense.


Why not? "cron" and "at" manage the startup of periodic system processes. The only thing they do different from "init" is that they start the processes at a time other than bootup. Everything else is common between them. So why not de-duplicate the effort involved in starting, tracking and logging, and just allow "init" to start other processes at times other than boot?

Replacing a simple /etc/crontab text file with multiple, awkwardly named XML plist files scattered among no less than four different directories is taking two big steps toward complexity.


There's no reason that systemd would be implemented that badly. In fact, I'm pretty sure that systemd reads existing "crontab" files just fine. So systemd doesn't require any changes there.

Starting infrequently used on-demand socket-based daemons from launchd seems like it could open the main system process to a potential denial of service attack. I have not explored this idea or researched to see if it has already been tried,


Well, I haven't researched it, that looks like nothing more than FUD and making-shit-up to me.

One of the core principles of Unix programing is do one thing and do it well.


Like having one and only one place to consistently manage the startup and monitoring of system processes? Oh yeah, that's totally anti-Unix-philosophy.

Posted by Lennart at Tue Aug 24 14:23:31 2010
Cameron: the kernel log buffer only includes timestamps when this is enabled on the kernel command line. A good syslog implementation could read those timestamps and handle them properly. However, I think the current implementations unfortunately don't do that.

Stan, we only support OpenSUSE extension for the LSB/SysV stuff which in the long run is legacy anyway.

Anonymous: there's no such thing as a C-based init script. That's a misconception.

Alexandr: yes, we require a very new kernel. Which is a safe requirement to make for something that needs to be integrated by the distributor anyway.

Posted by Anonymous at Tue Aug 24 15:08:39 2010
Lennart: You said in your post that "We reimplemented almost all boot-up and shutdown scripts of the standard Fedora install in much smaller, simpler and faster C utilities, or in systemd itself."  "C-based init scripts" seemed like a fair paraphrase of that sentence; would you prefer "C replacements for init scripts"?  Either way, I think my original question still applies; I'd love to hear more about them in the future, if you'd consider writing more about them.

Posted by Aleks at Tue Aug 24 15:58:55 2010
Great work Lennart! I'm very impressed by the progress of systemd and excited about trying it out.

Posted by Marius Gedminas at Tue Aug 24 16:59:14 2010
Could you post an example of a pretty process tree produced by systemd-cgls?

How does the systemd distinguish user processes that should be killed on logout from processes that should be left running (e.g. screen, nohup, wget -b)?

(Why does this form keep rejecting my comments?  Try #3.)

Posted by Lennart at Tue Aug 24 19:21:04 2010
Anonymous: well, what happens with the boot scripts depends on the case. One example: part of the boot and shutdown scripts it is to restore and save the random seed of /dev/random. This was previously done via some shell hackery. In systemd, we replaced that by a simple C program, i.e. this one: http://cgit.freedesktop.org/systemd/tree/src/random-seed.c -- which can easily be called from a simple .service unit in systemd, i.e. this one: http://cgit.freedesktop.org/systemd/tree/units/systemd-random-seed-load.service.in -- and that's all there is to it.

Marius, check http://www.freedesktop.org/wiki/Software/systemd/TipsAndTricks at the end. systemd doesn't duistinguish user processes that should be killed or not. This is about security, and it's a decision of the administrator if he wants to allow the user to keep processes around after logout or not, regardless if that process is called "screen" or "foobar" or whatever. However, privileged processes can escape this, and make themselves a member of an arbitrary cgroup of the system and thus avoid being killed when the user logs out. This could even be done via PAM, where invoking the PAM session hooks whcih will create a new session cgroup and move the calling process into it. For example, if it is desirable that the user may keep processes around after logout via screen and only screen, then screen should be patched to call into PAM (which I think it might actually already do in some cases). But again, just calling a process "screen" should never be something magic that allows you to keep a process around. This must be possible only via privileged code and not otherwise.

Posted by Lennart at Tue Aug 24 19:37:50 2010
Denice, Linux is a scalabale operating system. It is used on big irons to tiniest devices. With systemd we try to cover the whole bandwidth, and please understand that your specific use case is not the only one we need to cover.

Shane, you are right, systemd is nothing like traditional Unix. And that is a good thing. Unix has been designed 41 years ago. You honestly believe that its design is perfect and flawless and 41 years after it was designed still should be followed in all detail? No, computers changed, and Unix never was perfect. It probably was a better design than most other operating systems, but this does not mean it is perfect and we should never depart from it. systemd is inspired by Unix, but also from what has been done on MacOS and even on the Windows world, and on Solaris. We didn't copy any of the existing services 1:1, we just let us inspire by their best features and translated them to Linux and added quite a bit of new stuff on top. And that's how it should be done. Unix is an inspiration, it is not the holy grail. Not 41y after it was designed.

The fact that on traditional Unix the init system was seperate from cron, from at, from inetd, from the dbus service activator and from everything else meant that all of them reimplemented a big chunk of their code, i.e. what was involved with spawning processes. It was a useless code duplication, and all implementations sucked at it in one way or another. Also, you could not run the same thing from more than one of these systems without manually ensuring that things would happen race-freely and properly ordered. In systemd we unified all of this. We use the same codepaths for spawning processes, regardless if they are started via timers, via sockets, via busses, at boot-up, via devices and so on. This allows us to reduce the amount of code duplication, and provide the same awesome process babysitting to all triggers. And that is a big big advantage. If you look at the systemd source code you will notice that the remaining amount of code, for example for doing timer-based spawning is actually very very short, less than 500 lines (including comments and whitespace!). So overall, we simplify things drastically, we get rid of immense code duplication, and we still are a lot more powerful than what came before.

So, in summary: just because we do things differently doesn't mean we do it worse.

And if you tell me that systemd is not Unixy, then I can only agree, and I don't feel ashamed at all of that. Because my horizon is much further than just Unix.

Posted by Denice at Wed Aug 25 02:12:31 2010
Lennart, my 'specific use case', as you put it, is pretty standard actually.  I'm managing 300+ Linux servers (and a few handfuls of Solaris boxen), and we simply don't run lots of services on any of them.  Linux system administrators don't let the plethora of services run that you have in your example above.  What I am looking at above seems to be a desktop.  How about an example like I mention in one of your posts - just a typical targeted server...

Posted by Riku at Wed Aug 25 13:03:57 2010
That quite a bit of progress. I salute your "Get Things Done" attitude :)

Stupid question: What does systemd taking care of d-bus activation mean? eg. Why is current d-bus activation insufficient and how does systemd change that?

The timer part is exciting. But it doesn't replace atd and crond yet ;) According to manpage you can't seeminlgy set a timer to fire at specific time/day/daily.

Posted by Giovanni at Thu Aug 26 02:14:32 2010
I find Solaris SMF one of the most amazing features that we as sysadmins have to aid us in managing hundreds of servers and it's great that something similar is making its way into Linux. Way to go!

Posted by Bryan Horstmann-Allen at Thu Aug 26 09:45:29 2010
Denice: What happens when the Linux OOM killer freaks out and kills a bunch of your services? What ensures they get restarted? Or that they're even running at all? (I guess if you aren't running "a lot" of services, you aren't doing much at all anyway.)

If you aren't using some form of daemon management (runit, daemontools, etc), in addition to your monitoring, you have failed.

Lennart: Nice to see the trend to more mature service management in the Linux space, but further fragmentation is annoying... Is Upstart horribly broken, or simply not extensive enough?

The addition of an API to manage services (and everything else systemd appears to manage) is completely awesome. Can't wait to see a Puppet/Chef provider. :)

Posted by Bryan Horstmann-Allen at Thu Aug 26 09:48:19 2010
Ah, I see your post on Upstart. Nevermind. :-)

Posted by Karel at Thu Aug 26 13:25:34 2010
I really love basic Unix principles and I think that good software should be based on KISS rules. And from my point of view systemd is not bad thing. (frankly, it looks better than PA:-)

It would be really nice to have one place where we manage system processes in userspace. The management should be integrated to Linux -- Linux means cgroups, udev, shared mount subtrees (namespaces), selinux, inotify, etc. It does not make any sense to ignore the modern technologies that are implemented in kernel or use the technologies separately.

Posted by dissent at Thu Aug 26 16:14:06 2010
you must love to reimplement perfectly working stuff in a very "futuristic" way... and the talk about not caring for compatibility with "irrelevant" systems/distros make you look so adventurous and sexy...

Posted by hreidmarr at Thu Aug 26 18:36:07 2010
I smell problems. Tons of them. And, as always, Fedora will be the catalyst.

Anyway, let the world burn!

Posted by fran at Fri Aug 27 16:18:46 2010
Hey dissent, yes we still love our commodore 64s too.

Stick to CentOS if you can't stand change.

Leave a Comment:

Your Name:


Your E-mail (optional):


Comment:


As a protection against comment spam, please type the following number into the field on the right:
Secret Number Image

Please note that this is neither a support forum nor a bug tracker! Support questions or bug reports posted here will be ignored and not responded to!


It should be obvious but in case it isn't: the opinions reflected here are my own. They are not the views of my employer, or Ronald McDonald, or anyone else.

Please note that I take the liberty to delete any comments posted here that I deem inappropriate, off-topic, or insulting. And I excercise this liberty quite agressively. So yes, if you comment here, I might censor you. If you don't want to be censored your are welcome to comment on your own blog instead.


Lennart Poettering <mzoybt (at) 0pointer (dot) net>
Syndicated on Planet GNOME, Planet Fedora, planet.freedesktop.org, Planet Debian Upstream. feed RSS 0.91, RSS 2.0
Archives: 2005, 2006, 2007, 2008, 2009, 2010

Valid XHTML 1.0 Strict!   Valid CSS!