レナート   PID EINS!   ﻟﻴﻨﺎﺭﺕ

Fri, 04 Jul 2014

FUDCON + GNOME.Asia Beijing 2014

Thanks to the funding from FUDCON I had the chance to attend and keynote at the combined FUDCON Beijing 2014 and GNOME.Asia 2014 conference in Beijing, China.

My talk was about systemd's present and future, what we achieved and where we are going. In my talk I tried to explain a bit where we are coming from, and how we changed focus from being purely an init system, to more being a set of basic building blocks to build an OS from. Most of the talk I talked about where we still intend to take systemd, which areas we believe should be covered by systemd, and of course, also the always difficult question, on where to draw the line and what clearly is outside of the focus of systemd. The slides of my talk you find online. (No video recording I am aware of, sorry.)

The combined conferences were a lot of fun, and as usual, the best discussions I had in the hallway track, discussing Linux and systemd.

A number of pictures of the conference are now online. Enjoy!

After the conference I stayed for a few more days in Beijing, doing a bit of sightseeing. What a fantastic city! The food was amazing, we tried all kinds of fantastic stuff, from Peking duck, to Bullfrog Sechuan style. Yummy. And one of those days I am sure I will find the time to actually sort my photos and put them online, too.

I am really looking forward to the next FUDCON/GNOME.Asia!

posted at: 18:43 | path: /projects | permanent link to this entry | comments

Tue, 17 Jun 2014

Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems

(Just a small heads-up: I don't blog as much as I used to, I nowadays update my Google+ page a lot more frequently. You might want to subscribe that if you are interested in more frequent technical updates on what we are working on.)

In the past weeks we have been working on a couple of features for systemd that enable a number of new usecases I'd like to shed some light on. Taking benefit of the /usr merge that a number of distributions have completed we want to bring runtime behaviour of Linux systems to the next level. With the /usr merge completed most static vendor-supplied OS data is found exclusively in /usr, only a few additional bits in /var and /etc are necessary to make a system boot. On this we can build to enable a couple of new features:

  1. A mechanism we call Factory Reset shall flush out /etc and /var, but keep the vendor-supplied /usr, bringing the system back into a well-defined, pristine vendor state with no local state or configuration. This functionality is useful across the board from servers, to desktops, to embedded devices.
  2. A Stateless System goes one step further: a system like this never stores /etc or /var on persistent storage, but always comes up with pristine vendor state. On systems like this every reboot acts as factor reset. This functionality is particularly useful for simple containers or systems that boot off the network or read-only media, and receive all configuration they need during runtime from vendor packages or protocols like DHCP or are capable of discovering their parameters automatically from the available hardware or periphery.
  3. Reproducible Systems multiply a vendor image into many containers or systems. Only local configuration or state is stored per-system, while the vendor operating system is pulled in from the same, immutable, shared snapshot. Each system hence has its private /etc and /var for receiving local configuration, however the OS tree in /usr is pulled in via bind mounts (in case of containers) or technologies like NFS (in case of physical systems), or btrfs snapshots from a golden master image. This is particular interesting for containers where the goal is to run thousands of container images from the same OS tree. However, it also has a number of other usecases, for example thin client systems, which can boot the same NFS share a number of times. Furthermore this mechanism is useful to implement very simple OS installers, that simply unserialize a /usr snapshot into a file system, install a boot loader, and reboot.
  4. Verifiable Systems are closely related to stateless systems: if the underlying storage technology can cryptographically ensure that the vendor-supplied OS is trusted and in a consistent state, then it must be made sure that /etc or /var are either included in the OS image, or simply unnecessary for booting.


A number of Linux-based operating systems have tried to implement some of the schemes described out above in one way or another. Particularly interesting are GNOME's OSTree, CoreOS and Google's Android and ChromeOS. They generally found different solutions for the specific problems you have when implementing schemes like this, sometimes taking shortcuts that keep only the specific case in mind, and cannot cover the general purpose. With systemd now being at the core of so many distributions and deeply involved in bringing up and maintaining the system we came to the conclusion that we should attempt to add generic support for setups like this to systemd itself, to open this up for the general purpose distributions to build on. We decided to focus on three kinds of systems:

  1. The stateful system, the traditional system as we know it with machine-specific /etc, /usr and /var, all properly populated.
  2. Startup without a populated /var, but with configured /etc. (We will call these volatile systems.)
  3. Startup without either /etc or /var (We will call these stateless systems.).

A factory reset is just a special case of the latter two modes, where the system boots up without /var and /etc but the next boot is a normal stateful boot like like the first described mode. Note that a mode where /etc is flushed, but /var is not is nothing we intend to cover (why? well, the user ID question becomes much harder, see below, and we simply saw no usecase for it worth the trouble).


Booting up a system without a populated /var is relatively straight-forward. With a few lines of tmpfiles configuration it is possible to populate /var with its basic structure in a way that is sufficient to make a system boot cleanly. systemd version 214 and newer ship with support for this. Of course, support for this scheme in systemd is only a small part of the solution. While a lot of software reconstructs the directory hierarchy it needs in /var automatically, many software does not. In case like this it is necessary to ship a couple of additional tmpfiles lines that setup up at boot-time the necessary files or directories in /var to make the software operate, similar to what RPM or DEB packages would set up at installation time.

Booting up a system without a populated /etc is a more difficult task. In /etc we have a lot of configuration bits that are essential for the system to operate, for example and most importantly system user and group information in /etc/passwd and /etc/group. If the system boots up without /etc there must be a way to replicate the minimal information necessary in it, so that the system manages to boot up fully.

To make this even more complex, in order to support "offline" updates of /usr that are replicated into a number of systems possessing private /etc and /var there needs to be a way how these directories can be upgraded transparently when necessary, for example by recreating caches like /etc/ld.so.cache or adding missing system users to /etc/passwd on next reboot.

Starting with systemd 215 (yet unreleased, as I type this) we will ship with a number of features in systemd that make /etc-less boots functional:

What's Next

Pulling this all together we are very close to making boots with empty /etc and /var on general purpose Linux operating systems a reality. Of course, while doing the groundwork in systemd gets us some distance, there's a lot of work left. Most importantly: the majority of Linux packages are simply incomptible with this scheme the way they are currently set up. They do not work without configuration in /etc or state directories in /var; they do not drop system user information in /usr/lib/sysusers.d. However, we believe it's our job to do the groundwork, and to start somewhere.

So what does this mean for the next steps? Of course, currently very little of this is available in any distribution (simply already because 215 isn't even released yet). However, this will hopefully change quickly. As soon as that is accomplished we can start working on making the other components of the OS work nicely in this scheme. If you are an upstream developer, please consider making your software work correctly if /etc and/or /var are not populated. This means:

If you are a packager, you can also help on making this all work:

Of course, before moving to declarative system user definitions you should consult with your distribution whether their packaging policy even allows that. Currently, most distributions will not, so we have to work to get this changed first.

Anyway, so much about what we have been working on and where we want to take this.


Before we finish, let me stress again why we are doing all this:

  1. For end-user machines like desktops, tablets or mobile phones, we want a generic way to implement factory reset, which the user can make use of when the system is broken (saves you support costs), or when he wants to sell it and get rid of his private data, and renew that "fresh car smell".
  2. For embedded machines we want a generic way how to reset devices. We also want a way how every single boot can be identical to a factory reset, in a stateless system design.
  3. For all kinds of systems we want to centralize vendor data in /usr so that it can be strictly read-only, and fully cryptographically verified as one unit.
  4. We want to enable new kinds of OS installers that simply deserialize a vendor OS /usr snapshot into a new file system, install a boot loader and reboot, leaving all first-time configuration to the next boot.
  5. We want to enable new kinds of OS updaters that build on this, and manage a number of vendor OS /usr snapshots in verified states, and which can then update /etc and /var simply by rebooting into a newer version.
  6. We wanto to scale container setups naturally, by sharing a single golden master /usr tree with a large number of instances that simply maintain their own private /etc and /var for their private configuration and state, while still allowing clean updates of /usr.
  7. We want to make thin clients that share /usr across the network work by allowing stateless bootups. During all discussions on how /usr was to be organized this was fequently mentioned. A setup like this so far only worked in very specific cases, with this scheme we want to make this work in general case.

Of course, we have no illusions, just doing the groundwork for all of this in systemd doesn't make this all a real-life solution yet. Also, it's very unlikely that all of Fedora (or any other general purpose distribution) will support this scheme for all its packages soon, however, we are quite confident that the idea is convincing, that we need to start somewhere, and that getting the most core packages adapted to this shouldn't be out of reach.

Oh, and of course, the concepts behind this are really not new, we know that. However, what's new here is that we try to make them available in a general purpose OS core, instead of special purpose systems.

Anyway, let's get the ball rolling! Late's make stateless systems a reality!

And that's all I have for now. I am sure this leaves a lot of questions open. If you have any, join us on IRC on #systemd on freenode or comment on Google+.

posted at: 18:13 | path: /projects | permanent link to this entry | comments

Mon, 01 Jul 2013

Upcoming Events

You are invited to three events:

Christoph Wickert set up a Fedora 19 Release Party here in Berlin! Please join us on Tuesday, July 2nd.

We'll have another Berlin Open Source Meetup on Sunday, July 14th.

And finally, theres' going to be another systemd Hackfest, this time colocated with GUADEC, on Tuesday/Wednesday, August 6th/7th.

See you soon!

posted at: 01:04 | path: /projects | permanent link to this entry | comments

Sun, 09 Jun 2013

GNOME.Asia and LinuxCon Japan

Two weeks ago I attended GNOME.Asia/Seoul and LinuxCon Japan/Tokyo, thanks to sponsoring by the GNOME Foundation and the Linux Foundation. At GNOME.Asia I spoke about Sandboxed Applications for GNOME, and at LinuxCon Japan about the first three years of systemd. (I think at least the latter one was videotaped, and recordings might show up on the net eventually). I like to believe both talks went pretty well, and helped getting the message across to community what we are working on and what the roadmap for us is, and what we expect from the various projects, and especially GNOME. However, for me personally the hallway track was the most interesting part. The personal Q&A regarding our work on kdbus, cgroups, systemd and related projects where highly interesting. In fact, at both conferences we had something like impromptu hackfests on the topics of kdbus and cgroups, with some conferences attendees. I also enjoyed the opportunity to be on Karen's upcoming GNOME podcast, recorded in a session at Gyeongbokgung Palace in Seoul (what better place could there be for a podcast recording?).

I'd like to thank the GNOME and Linux foundations for sponsoring my attendance to these conferences. I'd especially like to thank the organizers of GNOME.Asia for their perfectly organized conference!

GNOME Travel Badge

posted at: 16:30 | path: /projects | permanent link to this entry | comments

Mon, 08 Apr 2013

It's Time Again!

My fellow Berliners! There's another Berlin Open Source Meetup scheduled for this Sunday! You are invited!

See you on Sunday!

posted at: 10:58 | path: /projects | permanent link to this entry | comments

Thu, 14 Mar 2013

What Are We Breaking Now?

End of February devconf.cz took place in Brno, Czech Republic. At the conference Kay Sievers, Harald Hoyer and I did two presentations about our work on systemd and about the systemd Journal. These talks were taped and the recordings are now available online.

First, here's our talk about What Are We Breaking Now?, in which we try to give an overview on what we are working on currently in the systemd context, and what we expect to do in the next few months. We cover Predictable Network Interface Names, the Boot Loader Spec, kdbus, the Apps framework, and more.

And then, I did my second talk about The systemd Journal, with a focus on how to practically make use of journalctl, as a day-to-day tool for administrators (these practical bits start around 28:40). The commands demoed here are all explained in an earlier blog story of mine.

Unfortunately, the audience questions are sometimes hard or impossible to understand from the videos, and sometimes the text on the slides is hard to read, but I still believe that the two talks are quite interesting.

posted at: 16:58 | path: /projects | permanent link to this entry | comments

Mon, 18 Feb 2013

systemd Hackfest!

Hey, you, systemd hacker, Fedora hacker! Listen up! This Thu/Fri is the systemd Hackfest in Brno/Czech Rep, right before devconf.cz! On thursday we'll talk about (and hack on) all things systemd. And the hackfest friday is going to be a Fedora Activity Day, so we'll have a focus on systemd integration into Fedora.

You are invited!

See you in Brno!

posted at: 18:59 | path: /projects | permanent link to this entry | comments

Sat, 26 Jan 2013

The Biggest Myths

Since we first proposed systemd for inclusion in the distributions it has been frequently discussed in many forums, mailing lists and conferences. In these discussions one can often hear certain myths about systemd, that are repeated over and over again, but certainly don't gain any truth by constant repetition. Let's take the time to debunk a few of them:

  1. Myth: systemd is monolithic.

    If you build systemd with all configuration options enabled you will build 69 individual binaries. These binaries all serve different tasks, and are neatly separated for a number of reasons. For example, we designed systemd with security in mind, hence most daemons run at minimal privileges (using kernel capabilities, for example) and are responsible for very specific tasks only, to minimize their security surface and impact. Also, systemd parallelizes the boot more than any prior solution. This parallization happens by running more processes in parallel. Thus it is essential that systemd is nicely split up into many binaries and thus processes. In fact, many of these binaries[1] are separated out so nicely, that they are very useful outside of systemd, too.

    A package involving 69 individual binaries can hardly be called monolithic. What is different from prior solutions however, is that we ship more components in a single tarball, and maintain them upstream in a single repository with a unified release cycle.

  2. Myth: systemd is about speed.

    Yes, systemd is fast (A pretty complete userspace boot-up in ~900ms, anyone?), but that's primarily just a side-effect of doing things right. In fact, we never really sat down and optimized the last tiny bit of performance out of systemd. Instead, we actually frequently knowingly picked the slightly slower code paths in order to keep the code more readable. This doesn't mean being fast was irrelevant for us, but reducing systemd to its speed is certainly quite a misconception, since that is certainly not anywhere near the top of our list of goals.

  3. Myth: systemd's fast boot-up is irrelevant for servers.

    That is just completely not true. Many administrators actually are keen on reduced downtimes during maintenance windows. In High Availability setups it's kinda nice if the failed machine comes back up really fast. In cloud setups with a large number of VMs or containers the price of slow boots multiplies with the number of instances. Spending minutes of CPU and IO on really slow boots of hundreds of VMs or containers reduces your system's density drastically, heck, it even costs you more energy. Slow boots can be quite financially expensive. Then, fast booting of containers allows you to implement a logic such as socket activated containers, allowing you to drastically increase the density of your cloud system.

    Of course, in many server setups boot-up is indeed irrelevant, but systemd is supposed to cover the whole range. And yes, I am aware that often it is the server firmware that costs the most time at boot-up, and the OS anyways fast compared to that, but well, systemd is still supposed to cover the whole range (see above...), and no, not all servers have such bad firmware, and certainly not VMs and containers, which are servers of a kind, too.[2]

  4. Myth: systemd is incompatible with shell scripts.

    This is entirely bogus. We just don't use them for the boot process, because we believe they aren't the best tool for that specific purpose, but that doesn't mean systemd was incompatible with them. You can easily run shell scripts as systemd services, heck, you can run scripts written in any language as systemd services, systemd doesn't care the slightest bit what's inside your executable. Moreover, we heavily use shell scripts for our own purposes, for installing, building, testing systemd. And you can stick your scripts in the early boot process, use them for normal services, you can run them at latest shutdown, there are practically no limits.

  5. Myth: systemd is difficult.

    This also is entire non-sense. A systemd platform is actually much simpler than traditional Linuxes because it unifies system objects and their dependencies as systemd units. The configuration file language is very simple, and redundant configuration files we got rid of. We provide uniform tools for much of the configuration of the system. The system is much less conglomerate than traditional Linuxes are. We also have pretty comprehensive documentation (all linked from the homepage) about pretty much every detail of systemd, and this not only covers admin/user-facing interfaces, but also developer APIs.

    systemd certainly comes with a learning curve. Everything does. However, we like to believe that it is actually simpler to understand systemd than a Shell-based boot for most people. Surprised we say that? Well, as it turns out, Shell is not a pretty language to learn, it's syntax is arcane and complex. systemd unit files are substantially easier to understand, they do not expose a programming language, but are simple and declarative by nature. That all said, if you are experienced in shell, then yes, adopting systemd will take a bit of learning.

    To make learning easy we tried hard to provide the maximum compatibility to previous solutions. But not only that, on many distributions you'll find that some of the traditional tools will now even tell you -- while executing what you are asking for -- how you could do it with the newer tools instead, in a possibly nicer way.

    Anyway, the take-away is probably that systemd is probably as simple as such a system can be, and that we try hard to make it easy to learn. But yes, if you know sysvinit then adopting systemd will require a bit learning, but quite frankly if you mastered sysvinit, then systemd should be easy for you.

  6. Myth: systemd is not modular.

    Not true at all. At compile time you have a number of configure switches to select what you want to build, and what not. And we document how you can select in even more detail what you need, going beyond our configure switches.

    This modularity is not totally unlike the one of the Linux kernel, where you can select many features individually at compile time. If the kernel is modular enough for you then systemd should be pretty close, too.

  7. Myth: systemd is only for desktops.

    That is certainly not true. With systemd we try to cover pretty much the same range as Linux itself does. While we care for desktop uses, we also care pretty much the same way for server uses, and embedded uses as well. You can bet that Red Hat wouldn't make it a core piece of RHEL7 if it wasn't the best option for managing services on servers.

    People from numerous companies work on systemd. Car manufactureres build it into cars, Red Hat uses it for a server operating system, and GNOME uses many of its interfaces for improving the desktop. You find it in toys, in space telescopes, and in wind turbines.

    Most features I most recently worked on are probably relevant primarily on servers, such as container support, resource management or the security features. We cover desktop systems pretty well already, and there are number of companies doing systemd development for embedded, some even offer consulting services in it.

  8. Myth: systemd was created as result of the NIH syndrome.

    This is not true. Before we began working on systemd we were pushing for Canonical's Upstart to be widely adopted (and Fedora/RHEL used it too for a while). However, we eventually came to the conclusion that its design was inherently flawed at its core (at least in our eyes: most fundamentally, it leaves dependency management to the admin/developer, instead of solving this hard problem in code), and if something's wrong in the core you better replace it, rather than fix it. This was hardly the only reason though, other things that came into play, such as the licensing/contribution agreement mess around it. NIH wasn't one of the reasons, though...[3]

  9. Myth: systemd is a freedesktop.org project.

    Well, systemd is certainly hosted at fdo, but freedesktop.org is little else but a repository for code and documentation. Pretty much any coder can request a repository there and dump his stuff there (as long as it's somewhat relevant for the infrastructure of free systems). There's no cabal involved, no "standardization" scheme, no project vetting, nothing. It's just a nice, free, reliable place to have your repository. In that regard it's a bit like SourceForge, github, kernel.org, just not commercial and without over-the-top requirements, and hence a good place to keep our stuff.

    So yes, we host our stuff at fdo, but the implied assumption of this myth in that there was a group of people who meet and then agree on how the future free systems look like, is entirely bogus.

  10. Myth: systemd is not UNIX.

    There's certainly some truth in that. systemd's sources do not contain a single line of code originating from original UNIX. However, we derive inspiration from UNIX, and thus there's a ton of UNIX in systemd. For example, the UNIX idea of "everything is a file" finds reflection in that in systemd all services are exposed at runtime in a kernel file system, the cgroupfs. Then, one of the original features of UNIX was multi-seat support, based on built-in terminal support. Text terminals are hardly the state of the art how you interface with your computer these days however. With systemd we brought native multi-seat support back, but this time with full support for today's hardware, covering graphics, mice, audio, webcams and more, and all that fully automatic, hotplug-capable and without configuration. In fact the design of systemd as a suite of integrated tools that each have their individual purposes but when used together are more than just the sum of the parts, that's pretty much at the core of UNIX philosophy. Then, the way our project is handled (i.e. maintaining much of the core OS in a single git repository) is much closer to the BSD model (which is a true UNIX, unlike Linux) of doing things (where most of the core OS is kept in a single CVS/SVN repository) than things on Linux ever were.

    Ultimately, UNIX is something different for everybody. For us systemd maintainers it is something we derive inspiration from. For others it is a religion, and much like the other world religions there are different readings and understandings of it. Some define UNIX based on specific pieces of code heritage, others see it just as a set of ideas, others as a set of commands or APIs, and even others as a definition of behaviours. Of course, it is impossible to ever make all these people happy.

    Ultimately the question whether something is UNIX or not matters very little. Being technically excellent is hardly exclusive to UNIX. For us, UNIX is a major influence (heck, the biggest one), but we also have other influences. Hence in some areas systemd will be very UNIXy, and in others a little bit less.

  11. Myth: systemd is complex.

    There's certainly some truth in that. Modern computers are complex beasts, and the OS running on it will hence have to be complex too. However, systemd is certainly not more complex than prior implementations of the same components. Much rather, it's simpler, and has less redundancy (see above). Moreover, building a simple OS based on systemd will involve much fewer packages than a traditional Linux did. Fewer packages makes it easier to build your system, gets rid of interdependencies and of much of the different behaviour of every component involved.

  12. Myth: systemd is bloated.

    Well, bloated certainly has many different definitions. But in most definitions systemd is probably the opposite of bloat. Since systemd components share a common code base, they tend to share much more code for common code paths. Here's an example: in a traditional Linux setup, sysvinit, start-stop-daemon, inetd, cron, dbus, all implemented a scheme to execute processes with various configuration options in a certain, hopefully clean environment. On systemd the code paths for all of this, for the configuration parsing, as well as the actual execution is shared. This means less code, less place for mistakes, less memory and cache pressure, and is thus a very good thing. And as a side-effect you actually get a ton more functionality for it...

    As mentioned above, systemd is also pretty modular. You can choose at build time which components you need, and which you don't need. People can hence specifically choose the level of "bloat" they want.

    When you build systemd, it only requires three dependencies: glibc, libcap and dbus. That's it. It can make use of more dependencies, but these are entirely optional.

    So, yeah, whichever way you look at it, it's really not bloated.

  13. Myth: systemd being Linux-only is not nice to the BSDs.

    Completely wrong. The BSD folks are pretty much uninterested in systemd. If systemd was portable, this would change nothing, they still wouldn't adopt it. And the same is true for the other Unixes in the world. Solaris has SMF, BSD has their own "rc" system, and they always maintained it separately from Linux. The init system is very close to the core of the entire OS. And these other operating systems hence define themselves among other things by their core userspace. The assumption that they'd adopt our core userspace if we just made it portable, is completely without any foundation.

  14. Myth: systemd being Linux-only makes it impossible for Debian to adopt it as default.

    Debian supports non-Linux kernels in their distribution. systemd won't run on those. Is that a problem though, and should that hinder them to adopt system as default? Not really. The folks who ported Debian to these other kernels were willing to invest time in a massive porting effort, they set up test and build systems, and patched and built numerous packages for their goal. The maintainance of both a systemd unit file and a classic init script for the packaged services is a negligable amount of work compared to that, especially since those scripts more often than not exist already.

  15. Myth: systemd could be ported to other kernels if its maintainers just wanted to.

    That is simply not true. Porting systemd to other kernel is not feasible. We just use too many Linux-specific interfaces. For a few one might find replacements on other kernels, some features one might want to turn off, but for most this is nor really possible. Here's a small, very incomprehensive list: cgroups, fanotify, umount2(), /proc/self/mountinfo (including notification), /dev/swaps (same), udev, netlink, the structure of /sys, /proc/$PID/comm, /proc/$PID/cmdline, /proc/$PID/loginuid, /proc/$PID/stat, /proc/$PID/session, /proc/$PID/exe, /proc/$PID/fd, tmpfs, devtmpfs, capabilities, namespaces of all kinds, various prctl()s, numerous ioctls, the mount() system call and its semantics, selinux, audit, inotify, statfs, O_DIRECTORY, O_NOATIME, /proc/$PID/root, waitid(), SCM_CREDENTIALS, SCM_RIGHTS, mkostemp(), /dev/input, ...

    And no, if you look at this list and pick out the few where you can think of obvious counterparts on other kernels, then think again, and look at the others you didn't pick, and the complexity of replacing them.

  16. Myth: systemd is not portable for no reason.

    Non-sense! We use the Linux-specific functionality because we need it to implement what we want. Linux has so many features that UNIX/POSIX didn't have, and we want to empower the user with them. These features are incredibly useful, but only if they are actually exposed in a friendly way to the user, and that's what we do with systemd.

  17. Myth: systemd uses binary configuration files.

    No idea who came up with this crazy myth, but it's absolutely not true. systemd is configured pretty much exclusively via simple text files. A few settings you can also alter with the kernel command line and via environment variables. There's nothing binary in its configuration (not even XML). Just plain, simple, easy-to-read text files.

  18. Myth: systemd is a feature creep.

    Well, systemd certainly covers more ground that it used to. It's not just an init system anymore, but the basic userspace building block to build an OS from, but we carefully make sure to keep most of the features optional. You can turn a lot off at compile time, and even more at runtime. Thus you can choose freely how much feature creeping you want.

  19. Myth: systemd forces you to do something.

    systemd is not the mafia. It's Free Software, you can do with it whatever you want, and that includes not using it. That's pretty much the opposite of "forcing".

  20. Myth: systemd makes it impossible to run syslog.

    Not true, we carefully made sure when we introduced the journal that all data is also passed on to any syslog daemon running. In fact, if something changed, then only that syslog gets more complete data now than it got before, since we now cover early boot stuff as well as STDOUT/STDERR of any system service.

  21. Myth: systemd is incompatible.

    We try very hard to provide the best possible compatibility with sysvinit. In fact, the vast majority of init scripts should work just fine on systemd, unmodified. However, there actually are indeed a few incompatibilities, but we try to document these and explain what to do about them. Ultimately every system that is not actually sysvinit itself will have a certain amount of incompatibilities with it since it will not share the exect same code paths.

    It is our goal to ensure that differences between the various distributions are kept at a minimum. That means unit files usually work just fine on a different distribution than you wrote it on, which is a big improvement over classic init scripts which are very hard to write in a way that they run on multiple Linux distributions, due to numerous incompatibilities between them.

  22. Myth: systemd is not scriptable, because of its D-Bus use.

    Not true. Pretty much every single D-Bus interface systemd provides is also available in a command line tool, for example in systemctl, loginctl, timedatectl, hostnamectl, localectl and suchlike. You can easily call these tools from shell scripts, they open up pretty much the entire API from the command line with easy-to-use commands.

    That said, D-Bus actually has bindings for almost any scripting language this world knows. Even from the shell you can invoke arbitrary D-Bus methods with dbus-send or gdbus. If anything, this improves scriptability due to the good support of D-Bus in the various scripting languages.

  23. Myth: systemd requires you to use some arcane configuration tools instead of allowing you to edit your configuration files directly.

    Not true at all. We offer some configuration tools, and using them gets you a bit of additional functionality (for example, command line completion for all settings!), but there's no need at all to use them. You can always edit the files in question directly if you wish, and that's fully supported. Of course sometimes you need to explicitly reload configuration of some daemon after editing the configuration, but that's pretty much true for most UNIX services.

  24. Myth: systemd is unstable and buggy.

    Certainly not according to our data. We have been monitoring the Fedora bug tracker (and some others) closely for a long long time. The number of bugs is very low for such a central component of the OS, especially if you discount the numerous RFE bugs we track for the project. We are pretty good in keeping systemd out of the list of blocker bugs of the distribution. We have a relatively fast development cycle with mostly incremental changes to keep quality and stability high.

  25. Myth: systemd is not debuggable.

    False. Some people try to imply that the shell was a good debugger. Well, it isn't really. In systemd we provide you with actual debugging features instead. For example: interactive debugging, verbose tracing, the ability to mask any component during boot, and more. Also, we provide documentation for it.

    It's certainly well debuggable, we needed that for our own development work, after all. But we'll grant you one thing: it uses different debugging tools, we believe more appropriate ones for the purpose, though.

  26. Myth: systemd makes changes for the changes' sake.

    Very much untrue. We pretty much exclusively have technical reasons for the changes we make, and we explain them in the various pieces of documentation, wiki pages, blog articles, mailing list announcements. We try hard to avoid making incompatible changes, and if we do we try to document the why and how in detail. And if you wonder about something, just ask us!

  27. Myth: systemd is a Red-Hat-only project, is private property of some smart-ass developers, who use it to push their views to the world.

    Not true. Currently, there are 16 hackers with commit powers to the systemd git tree. Of these 16 only six are employed by Red Hat. The 10 others are folks from ArchLinux, from Debian, from Intel, even from Canonical, Mandriva, Pantheon and a number of community folks with full commit rights. And they frequently commit big stuff, major changes. Then, there are 374 individuals with patches in our tree, and they too came from a number of different companies and backgrounds, and many of those have way more than one patch in the tree. The discussions about where we want to take systemd are done in the open, on our IRC channel (#systemd on freenode, you are always weclome), on our mailing list, and on public hackfests (such as our next one in Brno, you are invited). We regularly attend various conferences, to collect feedback, to explain what we are doing and why, like few others do. We maintain blogs, engage in social networks (we actually have some pretty interesting content on Google+, and our Google+ Community is pretty alive, too.), and try really hard to explain the why and the how how we do things, and to listen to feedback and figure out where the current issues are (for example, from that feedback we compiled this lists of often heard myths about systemd...).

    What most systemd contributors probably share is a rough idea how a good OS should look like, and the desire to make it happen. However, by the very nature of the project being Open Source, and rooted in the community systemd is just what people want it to be, and if it's not what they want then they can drive the direction with patches and code, and if that's not feasible, then there are numerous other options to use, too, systemd is never exclusive.

    One goal of systemd is to unify the dispersed Linux landscape a bit. We try to get rid of many of the more pointless differences of the various distributions in various areas of the core OS. As part of that we sometimes adopt schemes that were previously used by only one of the distributions and push it to a level where it's the default of systemd, trying to gently push everybody towards the same set of basic configuration. This is never exclusive though, distributions can continue to deviate from that if they wish, however, if they end-up using the well-supported default their work becomes much easier and they might gain a feature or two. Now, as it turns out, more frequently than not we actually adopted schemes that where Debianisms, rather than Fedoraisms/Redhatisms as best supported scheme by systemd. For example, systems running systemd now generally store their hostname in /etc/hostname, something that used to be specific to Debian and now is used across distributions.

    One thing we'll grant you though, we sometimes can be smart-asses. We try to be prepared whenever we open our mouth, in order to be able to back-up with facts what we claim. That might make us appear as smart-asses.

    But in general, yes, some of the more influental contributors of systemd work for Red Hat, but they are in the minority, and systemd is a healthy, open community with different interests, different backgrounds, just unified by a few rough ideas where the trip should go, a community where code and its design counts, and certainly not company affiliation.

  28. Myth: systemd doesn't support /usr split from the root directory.

    Non-sense. Since its beginnings systemd supports the --with-rootprefix= option to its configure script which allows you to tell systemd to neatly split up the stuff needed for early boot and the stuff needed for later on. All this logic is fully present and we keep it up-to-date right there in systemd's build system.

    Of course, we still don't think that actually booting with /usr unavailable is a good idea, but we support this just fine in our build system. This won't fix the inherent problems of the scheme that you'll encounter all across the board, but you can't blame that on systemd, because in systemd we support this just fine.

  29. Myth: systemd doesn't allow your to replace its components.

    Not true, you can turn off and replace pretty much any part of systemd, with very few exceptions. And those exceptions (such as journald) generally allow you to run an alternative side by side to it, while cooperating nicely with it.

  30. Myth: systemd's use of D-Bus instead of sockets makes it intransparent.

    This claim is already contradictory in itself: D-Bus uses sockets as transport, too. Hence whenever D-Bus is used to send something around, a socket is used for that too. D-Bus is mostly a standardized serialization of messages to send over these sockets. If anything this makes it more transparent, since this serialization is well documented, understood and there are numerous tracing tools and language bindings for it. This is very much unlike the usual homegrown protocols the various classic UNIX daemons use to communicate locally.

Hmm, did I write I just wanted to debunk a "few" myths? Maybe these were more than just a few... Anyway, I hope I managed to clear up a couple of misconceptions. Thanks for your time.


[1] For example, systemd-detect-virt, systemd-tmpfiles, systemd-udevd are.

[2] Also, we are trying to do our little part on maybe making this better. By exposing boot-time performance of the firmware more prominently in systemd's boot output we hope to shame the firmware writers to clean up their stuff.

[3] And anyways, guess which project includes a library "libnih" -- Upstart or systemd?[4]

[4] Hint: it's not systemd!

posted at: 02:43 | path: /projects | permanent link to this entry | comments

Wed, 09 Jan 2013

systemd for Administrators, Part XX

This is no time for procrastination, here is already the twentieth installment of my ongoing series on systemd for Administrators:

Socket Activated Internet Services and OS Containers

Socket Activation is an important feature of systemd. When we first announced systemd we already tried to make the point how great socket activation is for increasing parallelization and robustness of socket services, but also for simplifying the dependency logic of the boot. In this episode I'd like to explain why socket activation is an important tool for drastically improving how many services and even containers you can run on a single system with the same resource usage. Or in other words, how you can drive up the density of customer sites on a system while spending less on new hardware.

Socket Activated Internet Services

First, let's take a step back. What was socket activation again? -- Basically, socket activation simply means that systemd sets up listening sockets (IP or otherwise) on behalf of your services (without these running yet), and then starts (activates) the services as soon as the first connection comes in. Depending on the technology the services might idle for a while after having processed the connection and possible follow-up connections before they exit on their own, so that systemd will again listen on the sockets and activate the services again the next time they are connected to. For the client it is not visible whether the service it is interested in is currently running or not. The service's IP socket stays continously connectable, no connection attempt ever fails, and all connects will be processed promptly.

A setup like this lowers resource usage: as services are only running when needed they only consume resources when required. Many internet sites and services can benefit from that. For example, web site hosters will have noticed that of the multitude of web sites that are on the Internet only a tiny fraction gets a continous stream of requests: the huge majority of web sites still needs to be available all the time but gets requests only very unfrequently. With a scheme like socket activation you take benefit of this. By hosting many of these sites on a single system like this and only activating their services as necessary allows a large degree of over-commit: you can run more sites on your system than the available resources actually allow. Of course, one shouldn't over-commit too much to avoid contention during peak times.

Socket activation like this is easy to use in systemd. Many modern Internet daemons already support socket activation out of the box (and for those which don't yet it's not hard to add). Together with systemd's instantiated units support it is easy to write a pair of service and socket templates that then may be instantiated multiple times, once for each site. Then, (optionally) make use of some of the security features of systemd to nicely isolate the customer's site's services from each other (think: each customer's service should only see the home directory of the customer, everybody else's directories should be invisible), and there you go: you now have a highly scalable and reliable server system, that serves a maximum of securely sandboxed services at a minimum of resources, and all nicely done with built-in technology of your OS.

This kind of setup is already in production use in a number of companies. For example, the great folks at Pantheon are running their scalable instant Drupal system on a setup that is similar to this. (In fact, Pantheon's David Strauss pioneered this scheme. David, you rock!)

Socket Activated OS Containers

All of the above can already be done with older versions of systemd. If you use a distribution that is based on systemd, you can right-away set up a system like the one explained above. But let's take this one step further. With systemd 197 (to be included in Fedora 19), we added support for socket activating not only individual services, but entire OS containers. And I really have to say it at this point: this is stuff I am really excited about. ;-)

Basically, with socket activated OS containers, the host's systemd instance will listen on a number of ports on behalf of a container, for example one for SSH, one for web and one for the database, and as soon as the first connection comes in, it will spawn the container this is intended for, and pass to it all three sockets. Inside of the container, another systemd is running and will accept the sockets and then distribute them further, to the services running inside the container using normal socket activation. The SSH, web and database services will only see the inside of the container, even though they have been activated by sockets that were originally created on the host! Again, to the clients this all is not visible. That an entire OS container is spawned, triggered by simple network connection is entirely transparent to the client side.[1]

The OS containers may contain (as the name suggests) a full operating system, that might even be a different distribution than is running on the host. For example, you could run your host on Fedora, but run a number of Debian containers inside of it. The OS containers will have their own systemd init system, their own SSH instances, their own process tree, and so on, but will share a number of other facilities (such as memory management) with the host.

For now, only systemd's own trivial container manager, systemd-nspawn has been updated to support this kind of socket activation. We hope that libvirt-lxc will soon gain similar functionality. At this point, let's see in more detail how such a setup is configured in systemd using nspawn:

First, please use a tool such as debootstrap or yum's --installroot to set up a container OS tree[2]. The details of that are a bit out-of-focus for this story, there's plenty of documentation around how to do this. Of course, make sure you have systemd v197 installed inside the container. For accessing the container from the command line, consider using systemd-nspawn itself. After you configured everything properly, try to boot it up from the command line with systemd-nspawn's -b switch.

Assuming you now have a working container that boots up fine, let's write a service file for it, to turn the container into a systemd service on the host you can start and stop. Let's create /etc/systemd/system/mycontainer.service on the host:

Description=My little container

ExecStart=/usr/bin/systemd-nspawn -jbD /srv/mycontainer 3

This service can already be started and stopped via systemctl start and systemctl stop. However, there's no nice way to actually get a shell prompt inside the container. So let's add SSH to it, and even more: let's configure SSH so that a connection to the container's SSH port will socket-activate the entire container. First, let's begin with telling the host that it shall now listen on the SSH port of the container. Let's create /etc/systemd/system/mycontainer.socket on the host:

Description=The SSH socket of my little container


If we start this unit with systemctl start on the host then it will listen on port 23, and as soon as a connection comes in it will activate our container service we defined above. We pick port 23 here, instead of the usual 22, as our host's SSH is already listening on that. nspawn virtualizes the process list and the file system tree, but does not actually virtualize the network stack, hence we just pick different ports for the host and the various containers here.

Of course, the system inside the container doesn't yet know what to do with the socket it gets passed due to socket activation. If you'd now try to connect to the port, the container would start-up but the incoming connection would be immediately closed since the container can't handle it yet. Let's fix that!

All that's necessary for that is teach SSH inside the container socket activation. For that let's simply write a pair of socket and service units for SSH. Let's create /etc/systemd/system/sshd.socket in the container:

Description=SSH Socket for Per-Connection Servers


Then, let's add the matching SSH service file /etc/systemd/system/sshd@.service in the container:

Description=SSH Per-Connection Server for %I

ExecStart=-/usr/sbin/sshd -i

Then, make sure to hook sshd.socket into the sockets.target so that unit is started automatically when the container boots up:

ln -s /etc/systemd/system/sshd.socket /etc/systemd/system/sockets.target.wants/

And that's it. If we now activate mycontainer.socket on the host, the host's systemd will bind the socket and we can connect to it. If we do this, the host's systemd will activate the container, and pass the socket in to it. The container's systemd will then take the socket, match it up with sshd.socket inside the container. As there's still our incoming connection queued on it, it will then immediately trigger an instance of sshd@.service, and we'll have our login.

And that's already everything there is to it. You can easily add additional sockets to listen on to mycontainer.socket. Everything listed therein will be passed to the container on activation, and will be matched up as good as possible with all socket units configured inside the container. Sockets that cannot be matched up will be closed, and sockets that aren't passed in but are configured for listening will be bound be the container's systemd instance.

So, let's take a step back again. What did we gain through all of this? Well, basically, we can now offer a number of full OS containers on a single host, and the containers can offer their services without running continously. The density of OS containers on the host can hence be increased drastically.

Of course, this only works for kernel-based virtualization, not for hardware virtualization. i.e. something like this can only be implemented on systems such as libvirt-lxc or nspawn, but not in qemu/kvm.

If you have a number of containers set up like this, here's one cool thing the journal allows you to do. If you pass -m to journalctl on the host, it will automatically discover the journals of all local containers and interleave them on display. Nifty, eh?

With systemd 197 you have everything to set up your own socket activated OS containers on-board. However, there are a couple of improvements we're likely to add soon: for example, right now even if all services inside the container exit on idle, the container still will stay around, and we really should make it exit on idle too, if all its services exited and no logins are around. As it turns out we already have much of the infrastructure for this around: we can reuse the auto-suspend functionality we added for laptops: detecting when a laptop is idle and suspending it then is a very similar problem to detecting when a container is idle and shutting it down then.

Anyway, this blog story is already way too long. I hope I haven't lost you half-way already with all this talk of virtualization, sockets, services, different OSes and stuff. I hope this blog story is a good starting point for setting up powerful highly scalable server systems. If you want to know more, consult the documentation and drop by our IRC channel. Thank you!


[1] And BTW, this is another reason why fast boot times the way systemd offers them are actually a really good thing on servers, too.

[2] To make it easy: you need a command line such as yum --releasever=19 --nogpg --installroot=/srv/mycontainer/ --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal to install Fedora, and debootstrap --arch=amd64 unstable /srv/mycontainer/ to install Debian. Also see the bottom of systemd-nspawn(1). Also note that auditing is currently broken for containers, and if enabled in the kernel will cause all kinds of errors in the container. Use audit=0 on the host's kernel command line to turn it off.

posted at: 18:58 | path: /projects | permanent link to this entry | comments

Tue, 08 Jan 2013

systemd for Administrators, Part XIX

Happy new year 2013! Here is now the nineteenth installment of my ongoing series on systemd for Administrators:

Detecting Virtualization

When we started working on systemd we had a closer look on what the various existing init scripts used on Linux where actually doing. Among other things we noticed that a number of them where checking explicitly whether they were running in a virtualized environment (i.e. in a kvm, VMWare, LXC guest or suchlike) or not. Some init scripts disabled themselves in such cases[1], others enabled themselves only in such cases[2]. Frequently, it would probably have been a better idea to check for other conditions rather than explicitly checking for virtualization, but after looking at this from all sides we came to the conclusion that in many cases explicitly conditionalizing services based on detected virtualization is a valid thing to do. As a result we added a new configuration option to systemd that can be used to conditionalize services this way: ConditionVirtualization; we also added a small tool that can be used in shell scripts to detect virtualization: systemd-detect-virt(1); and finally, we added a minimal bus interface to query this from other applications.

Detecting whether your code is run inside a virtualized environment is actually not that hard. Depending on what precisely you want to detect it's little more than running the CPUID instruction and maybe checking a few files in /sys and /proc. The complexity is mostly about knowing the strings to look for, and keeping this list up-to-date. Currently, the the virtualization detection code in systemd can detect the following virtualization systems:

Let's have a look how one may make use if this functionality.

Conditionalizing Units

Adding ConditionVirtualization to the [Unit] section of a unit file is enough to conditionalize it depending on which virtualization is used or whether one is used at all. Here's an example:

Name=My Foobar Service (runs only only on guests)


Instead of specifiying "yes" or "no" it is possible to specify the ID of a specific virtualization solution (Example: "kvm", "vmware", ...), or either "container" or "vm" to check whether the kernel is virtualized or the hardware. Also, checks can be prefixed with an exclamation mark ("!") to invert a check. For further details see the manual page.

In Shell Scripts

In shell scripts it is easy to check for virtualized systems with the systemd-detect-virt(1) tool. Here's an example:

if systemd-detect-virt -q ; then
        echo "Virtualization is used:" `systemd-detect-virt`
        echo "No virtualization is used."

If this tool is run it will return with an exit code of zero (success) if a virtualization solution has been found, non-zero otherwise. It will also print a short identifier of the used virtualization solution, which can be suppressed with -q. Also, with the -c and -v parameters it is possible to detect only kernel or only hardware virtualization environments. For further details see the manual page.

In Programs

Whether virtualization is available is also exported on the system bus:

$ gdbus call --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1 --method org.freedesktop.DBus.Properties.Get org.freedesktop.systemd1.Manager Virtualization

This property contains the empty string if no virtualization is detected. Note that some container environments cannot be detected directly from unprivileged code. That's why we expose this property on the bus rather than providing a library -- the bus implicitly solves the privilege problem quite nicely.

Note that all of this will only ever detect and return information about the "inner-most" virtualization solution. If you stack virtualization ("We must go deeper!") then these interfaces will expose the one the code is most directly interfacing with. Specifically that means that if a container solution is used inside of a VM, then only the container is generally detected and returned.


[1] For example: running certain device management service in a container environment that has no access to any physical hardware makes little sense.

[2] For example: some VM solutions work best if certain vendor-specific userspace components are running that connect the guest with the host in some way.

posted at: 21:19 | path: /projects | permanent link to this entry | comments

Thu, 03 Jan 2013

Third Berlin Open Source Meetup

The Third Berlin Open Source Meetup is going to take place on Sunday, January 20th. You are invited!

It's a public event, so everybody is welcome, and please feel free to invite others!

posted at: 23:20 | path: /projects | permanent link to this entry | comments

Thu, 15 Nov 2012

foss.in Needs Your Funding!

One of the most exciting conferences in the Free Software world, foss.in in Bangalore, India has trouble finding enough sponsoring for this year's edition. Many speakers from all around the Free Software world (including yours truly) have signed up to present at the event, and the conference would appreciate any corporate funding they can get!

Please check if your company can help and contact the organizers for details!

See you in Bangalore!


posted at: 13:05 | path: /projects | permanent link to this entry | comments

Thu, 25 Oct 2012

systemd for Developers III

Here's the third episode of of my systemd for Developers series.

Logging to the Journal

In a recent blog story intended for administrators I shed some light on how to use the journalctl(1) tool to browse and search the systemd journal. In this blog story for developers I want to explain a little how to get log data into the systemd Journal in the first place.

The good thing is that getting log data into the Journal is not particularly hard, since there's a good chance the Journal already collects it anyway and writes it to disk. The journal collects:

  1. All data logged via libc syslog()
  2. The data from the kernel logged with printk()
  3. Everything written to STDOUT/STDERR of any system service

This covers pretty much all of the traditional log output of a Linux system, including messages from the kernel initialization phase, the initial RAM disk, the early boot logic, and the main system runtime.


Let's have a quick look how syslog() is used again. Let's write a journal message using this call:

#include <syslog.h>

int main(int argc, char *argv[]) {
        syslog(LOG_NOTICE, "Hello World");
        return 0;

This is C code, of course. Many higher level languages provide APIs that allow writing local syslog messages. Regardless which language you choose, all data written like this ends up in the Journal.

Let's have a look how this looks after it has been written into the journal (this is the JSON output journalctl -o json-pretty generates):

        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "_TRANSPORT" : "syslog",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "SYSLOG_FACILITY" : "1",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "_PID" : "3068",
        "SYSLOG_IDENTIFIER" : "test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351126905014938"

This nicely shows how the Journal implicitly augmented our little log message with various meta data fields which describe in more detail the context our message was generated from. For an explanation of the various fields, please refer to systemd.journal-fields(7)


If you are writing code that is run as a systemd service, generating journal messages is even easier:

#include <stdio.h>

int main(int argc, char *argv[]) {
        printf("Hello World\n");
        return 0;

Yupp, that's easy, indeed.

The printed string in this example is logged at a default log priority of LOG_INFO[1]. Sometimes it is useful to change the log priority for such a printed string. When systemd parses STDOUT/STDERR of a service it will look for priority values enclosed in < > at the beginning of each line[2], following the scheme used by the kernel's printk() which in turn took inspiration from the BSD syslog network serialization of messages. We can make use of this systemd feature like this:

#include <stdio.h>

#define PREFIX_NOTICE "<5>"

int main(int argc, char *argv[]) {
        printf(PREFIX_NOTICE "Hello World\n");
        return 0;

Nice! Logging with nothing but printf() but we still get log priorities!

This scheme works with any programming language, including, of course, shell:


echo "<5>Hellow world"

Native Messages

Now, what I explained above is not particularly exciting: the take-away is pretty much only that things end up in the journal if they are output using the traditional message printing APIs. Yaaawn!

Let's make this more interesting, let's look at what the Journal provides as native APIs for logging, and let's see what its benefits are. Let's translate our little example into the 1:1 counterpart using the Journal's logging API sd_journal_print(3):

#include <systemd/sd-journal.h>

int main(int argc, char *argv[]) {
        sd_journal_print(LOG_NOTICE, "Hello World");
        return 0;

This doesn't look much more interesting than the two examples above, right? After compiling this with `pkg-config --cflags --libs libsystemd-journal` appended to the compiler parameters, let's have a closer look at the JSON representation of the journal entry this generates:

        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World",
        "CODE_LINE" : "4",
        "_PID" : "3516",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351128226954170"

This looks pretty much the same, right? Almost! I highlighted three new lines compared to the earlier output. Yes, you guessed it, by using sd_journal_print() meta information about the generating source code location is implicitly appended to each message[3], which is helpful for a developer to identify the source of a problem.

The primary reason for using the Journal's native logging APIs is a not just the source code location however: it is to allow passing additional structured log messages from the program into the journal. This additional log data may the be used to search the journal for, is available for consumption for other programs, and might help the administrator to track down issues beyond what is expressed in the human readable message text. Here's and example how to do that with sd_journal_send():

#include <systemd/sd-journal.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
        sd_journal_send("MESSAGE=Hello World!",
                        "HOME=%s", getenv("HOME"),
                        "TERM=%s", getenv("TERM"),
                        "PAGE_SIZE=%li", sysconf(_SC_PAGESIZE),
                        "N_CPUS=%li", sysconf(_SC_NPROCESSORS_ONLN),
        return 0;

This will write a log message to the journal much like the earlier examples. However, this times a few additional, structured fields are attached:

        "__CURSOR" : "s=ac9e9c423355411d87bf0ba1a9b424e8;i=5930;b=5335e9cf5d954633bb99aefc0ec38c25;m=16544f875b;t=4ccd863cdc4f0;x=896defe53cc1a96a",
        "__REALTIME_TIMESTAMP" : "1351129666274544",
        "__MONOTONIC_TIMESTAMP" : "95903778651",
        "_BOOT_ID" : "5335e9cf5d954633bb99aefc0ec38c25",
        "PRIORITY" : "5",
        "_UID" : "500",
        "_GID" : "500",
        "_AUDIT_SESSION" : "2",
        "_AUDIT_LOGINUID" : "500",
        "_SYSTEMD_CGROUP" : "/user/lennart/2",
        "_SYSTEMD_SESSION" : "2",
        "_SELINUX_CONTEXT" : "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
        "_MACHINE_ID" : "a91663387a90b89f185d4e860000001a",
        "_HOSTNAME" : "epsilon",
        "CODE_FUNC" : "main",
        "_TRANSPORT" : "journal",
        "_COMM" : "test-journal-su",
        "_CMDLINE" : "./test-journal-submit",
        "CODE_FILE" : "src/journal/test-journal-submit.c",
        "_EXE" : "/home/lennart/projects/systemd/test-journal-submit",
        "MESSAGE" : "Hello World!",
        "_PID" : "4049",
        "CODE_LINE" : "6",
        "MESSAGE_ID" : "52fb62f99e2c49d89cfbf9d6de5e3555",
        "HOME" : "/home/lennart",
        "TERM" : "xterm-256color",
        "PAGE_SIZE" : "4096",
        "N_CPUS" : "4",
        "_SOURCE_REALTIME_TIMESTAMP" : "1351129666241467"

Awesome! Our simple example worked! The five meta data fields we attached to our message appeared in the journal. We used sd_journal_send() for this which works much like sd_journal_print() but takes a NULL terminated list of format strings each followed by its arguments. The format strings must include the field name and a '=' before the values.

Our little structured message included seven fields. The first three we passed are well-known fields:

  1. MESSAGE= is the actual human readable message part of the structured message.
  2. PRIORITY= is the numeric message priority value as known from BSD syslog formatted as an integer string.
  3. MESSAGE_ID= is a 128bit ID that identifies our specific message call, formatted as hexadecimal string. We randomly generated this string with journalctl --new-id128. This can be used by applications to track down all occasions of this specific message. The 128bit can be a UUID, but this is not a requirement or enforced.

Applications may relatively freely define additional fields as they see fit (we defined four pretty arbitrary ones in our example). A complete list of the currently well-known fields is available in systemd.journal-fields(7).

Let's see how the message ID helps us finding this message and all its occasions in the journal:

$ journalctl MESSAGE_ID=52fb62f99e2c49d89cfbf9d6de5e3555
-- Logs begin at Thu, 2012-10-18 04:07:03 CEST, end at Thu, 2012-10-25 04:48:21 CEST. --
Oct 25 03:47:46 epsilon test-journal-se[4049]: Hello World!
Oct 25 04:40:36 epsilon test-journal-se[4480]: Hello World!

Seems I already invoked this example tool twice!

Many messages systemd itself generates have message IDs. This is useful for example, to find all occasions where a program dumped core (journalctl MESSAGE_ID=fc2e22bc6ee647b6b90729ab34a250b1), or when a user logged in (journalctl MESSAGE_ID=8d45620c1a4348dbb17410da57c60c66). If your application generates a message that might be interesting to recognize in the journal stream later on, we recommend attaching such a message ID to it. You can easily allocate a new one for your message with journalctl --new-id128.

This example shows how we can use the Journal's native APIs to generate structured, recognizable messages. You can do much more than this with the C API. For example, you may store binary data in journal fields as well, which is useful to attach coredumps or hard disk SMART states to events where this applies. In order to make this blog story not longer than it already is we'll not go into detail about how to do this, an I ask you to check out sd_journal_send(3) for further information on this.


The examples above focus on C. Structured logging to the Journal is also available from other languages. Along with systemd itself we ship bindings for Python. Here's an example how to use this:

from systemd import journal
journal.send('Hello world')
journal.send('Hello, again, world', FIELD2='Greetings!', FIELD3='Guten tag')

Other binding exist for Node.js, PHP, Lua.


Generating structured data is a very useful feature for services to make their logs more accessible both for administrators and other programs. In addition to the implicit structure the Journal adds to all logged messages it is highly beneficial if the various components of our stack also provide explicit structure in their messages, coming from within the processes themselves.

Porting an existing program to the Journal's logging APIs comes with one pitfall though: the Journal is Linux-only. If non-Linux portability matters for your project it's a good idea to provide an alternative log output, and make it selectable at compile-time.

Regardless which way to log you choose, in all cases we'll forward the message to a classic syslog daemon running side-by-side with the Journal, if there is one. However, much of the structured meta data of the message is not forwarded since the classic syslog protocol simply has no generally accepted way to encode this and we shouldn't attempt to serialize meta data into classic syslog messages which might turn /var/log/messages into an unreadable dump of machine data. Anyway, to summarize this: regardless if you log with syslog(), printf(), sd_journal_print() or sd_journal_send(), the message will be stored and indexed by the journal and it will also be forwarded to classic syslog.

And that's it for today. In a follow-up episode we'll focus on retrieving messages from the Journal using the C API, possibly filtering for a specific subset of messages. Later on, I hope to give a real-life example how to port an existing service to the Journal's logging APIs. Stay tuned!


[1] This can be changed with the SyslogLevel= service setting. See systemd.exec(5) for details.

[2] Interpretation of the < > prefixes of logged lines may be disabled with the SyslogLevelPrefix= service setting. See systemd.exec(5) for details.

[3] Appending the code location to the log messages can be turned off at compile time by defining -DSD_JOURNAL_SUPPRESS_LOCATION.

posted at: 04:29 | path: /projects | permanent link to this entry | comments

Wed, 24 Oct 2012

systemd for Administrators, Part XVIII

Hot on the heels of the previous story, here's now the eighteenth installment of my ongoing series on systemd for Administrators:

Managing Resources

An important facet of modern computing is resource management: if you run more than one program on a single machine you want to assign the available resources to them enforcing particular policies. This is particularly crucial on smaller, embedded or mobile systems where the scarce resources are the main constraint, but equally for large installations such as cloud setups, where resources are plenty, but the number of programs/services/containers on a single node is drastically higher.

Traditionally, on Linux only one policy was really available: all processes got about the same CPU time, or IO bandwith, modulated a bit via the process nice value. This approach is very simple and covered the various uses for Linux quite well for a long time. However, it has drawbacks: not all all processes deserve to be even, and services involving lots of processes (think: Apache with a lot of CGI workers) this way would get more resources than services whith very few (think: syslog).

When thinking about service management for systemd, we quickly realized that resource management must be core functionality of it. In a modern world -- regardless if server or embedded -- controlling CPU, Memory, and IO resources of the various services cannot be an afterthought, but must be built-in as first-class service settings. And it must be per-service and not per-process as the traditional nice values or POSIX Resource Limits were.

In this story I want to shed some light on what you can do to enforce resource policies on systemd services. Resource Management in one way or another has been available in systemd for a while already, so it's really time we introduce this to the broader audience.

In an earlier blog post I highlighted the difference between Linux Control Croups (cgroups) as a labelled, hierarchal grouping mechanism, and Linux cgroups as a resource controlling subsystem. While systemd requires the former, the latter is optional. And this optional latter part is now what we can make use of to manage per-service resources. (At this points, it's probably a good idea to read up on cgroups before reading on, to get at least a basic idea what they are and what they accomplish. Even thought the explanations below will be pretty high-level, it all makes a lot more sense if you grok the background a bit.)

The main Linux cgroup controllers for resource management are cpu, memory and blkio. To make use of these, they need to be enabled in the kernel, which many distributions (including Fedora) do. systemd exposes a couple of high-level service settings to make use of these controllers without requiring too much knowledge of the gory kernel details.

Managing CPU

As a nice default, if the cpu controller is enabled in the kernel, systemd will create a cgroup for each service when starting it. Without any further configuration this already has one nice effect: on a systemd system every system service will get an even amount of CPU, regardless how many processes it consists off. Or in other words: on your web server MySQL will get the roughly same amount of CPU as Apache, even if the latter consists a 1000 CGI script processes, but the former only of a few worker tasks. (This behavior can be turned off, see DefaultControllers= in /etc/systemd/system.conf.)

On top of this default, it is possible to explicitly configure the CPU shares a service gets with the CPUShares= setting. The default value is 1024, if you increase this number you'll assign more CPU to a service than an unaltered one at 1024, if you decrease it, less.

Let's see in more detail, how we can make use of this. Let's say we want to assign Apache 1500 CPU shares instead of the default of 1024. For that, let's create a new administrator service file for Apache in /etc/systemd/system/httpd.service, overriding the vendor supplied one in /usr/lib/systemd/system/httpd.service, but let's change the CPUShares= parameter:

.include /usr/lib/systemd/system/httpd.service


The first line will pull in the vendor service file. Now, lets's reload systemd's configuration and restart Apache so that the new service file is taken into account:

systemctl daemon-reload
systemctl restart httpd.service

And yeah, that's already it, you are done!

(Note that setting CPUShares= in a unit file will cause the specific service to get its own cgroup in the cpu hierarchy, even if cpu is not included in DefaultControllers=.)

Analyzing Resource usage

Of course, changing resource assignments without actually understanding the resource usage of the services in questions is like blind flying. To help you understand the resource usage of all services, we created the tool systemd-cgtop, that will enumerate all cgroups of the system, determine their resource usage (CPU, Memory, and IO) and present them in a top-like fashion. Building on the fact that systemd services are managed in cgroups this tool hence can present to you for services what top shows you for processes.

Unfortunately, by default cgtop will only be able to chart CPU usage per-service for you, IO and Memory are only tracked as total for the entire machine. The reason for this is simply that by default there are no per-service cgroups in the blkio and memory controller hierarchies but that's what we need to determine the resource usage. The best way to get this data for all services is to simply add the memory and blkio controllers to the aforementioned DefaultControllers= setting in system.conf.

Managing Memory

To enforce limits on memory systemd provides the MemoryLimit=, and MemorySoftLimit= settings for services, summing up the memory of all its processes. These settings take memory sizes in bytes that are the total memory limit for the service. This setting understands the usual K, M, G, T suffixes for Kilobyte, Megabyte, Gigabyte, Terabyte (to the base of 1024).

.include /usr/lib/systemd/system/httpd.service


(Analogue to CPUShares= above setting this option will cause the service to get its own cgroup in the memory cgroup hierarchy.)

Managing Block IO

To control block IO multiple settings are available. First of all BlockIOWeight= may be used which assigns an IO weight to a specific service. In behaviour the weight concept is not unlike the shares concept of CPU resource control (see above). However, the default weight is 1000, and the valid range is from 10 to 1000:

.include /usr/lib/systemd/system/httpd.service


Optionally, per-device weights can be specified:

.include /usr/lib/systemd/system/httpd.service

BlockIOWeight=/dev/disk/by-id/ata-SAMSUNG_MMCRE28G8MXP-0VBL1_DC06K01009SE009B5252 750

Instead of specifiying an actual device node you also specify any path in the file system:

.include /usr/lib/systemd/system/httpd.service

BlockIOWeight=/home/lennart 750

If the specified path does not refer to a device node systemd will determine the block device /home/lennart is on, and assign the bandwith weight to it.

You can even add per-device and normal lines at the same time, which will set the per-device weight for the device, and the other value as default for everything else.

Alternatively one may control explicit bandwith limits with the BlockIOReadBandwidth= and BlockIOWriteBandwidth= settings. These settings take a pair of device node and bandwith rate (in bytes per second) or of a file path and bandwith rate:

.include /usr/lib/systemd/system/httpd.service

BlockIOReadBandwith=/var/log 5M

This sets the maximum read bandwith on the block device backing /var/log to 5Mb/s.

(Analogue to CPUShares= and MemoryLimit= using any of these three settings will result in the service getting its own cgroup in the blkio hierarchy.)

Managing Other Resource Parameters

The options described above cover only a small subset of the available controls the various Linux control group controllers expose. We picked these and added high-level options for them since we assumed that these are the most relevant for most folks, and that they really needed a nice interface that can handle units properly and resolve block device names.

In many cases the options explained above might not be sufficient for your usecase, but a low-level kernel cgroup setting might help. It is easy to make use of these options from systemd unit files, without having them covered with a high-level setting. For example, sometimes it might be useful to set the swappiness of a service. The kernel makes this controllable via the memory.swappiness cgroup attribute, but systemd does not expose it as a high-level option. Here's how you use it nonetheless, using the low-level ControlGroupAttribute= setting:

.include /usr/lib/systemd/system/httpd.service

ControlGroupAttribute=memory.swappiness 70

(Analogue to the other cases this too causes the service to be added to the memory hierarchy.)

Later on we might add more high-level controls for the various cgroup attributes. In fact, please ping us if you frequently use one and believe it deserves more focus. We'll consider adding a high-level option for it then. (Even better: send us a patch!)

Disclaimer: note that making use of the various resource controllers does have a runtime impact on the system. Enforcing resource limits comes at a price. If you do use them, certain operations do get slower. Especially the memory controller has (used to have?) a bad reputation to come at a performance cost.

For more details on all of this, please have a look at the documenation of the mentioned unit settings, and of the cpu, memory and blkio controllers.

And that's it for now. Of course, this blog story only focussed on the per-service resource settings. On top this, you can also set the more traditional, well-known per-process resource settings, which will then be inherited by the various subprocesses, but always only be enforced per-process. More specifically that's IOSchedulingClass=, IOSchedulingPriority=, CPUSchedulingPolicy=, CPUSchedulingPriority=, CPUAffinity=, LimitCPU= and related. These do not make use of cgroup controllers and have a much lower performance cost. We might cover those in a later article in more detail.

posted at: 04:11 | path: /projects | permanent link to this entry | comments

systemd for Administrators, Part XVII

It's that time again, here's now the seventeenth installment of my ongoing series on systemd for Administrators:

Using the Journal

A while back I already posted a blog story introducing some functionality of the journal, and how it is exposed in systemctl. In this episode I want to explain a few more uses of the journal, and how you can make it work for you.

If you are wondering what the journal is, here's an explanation in a few words to get you up to speed: the journal is a component of systemd, that captures Syslog messages, Kernel log messages, initial RAM disk and early boot messages as well as messages written to STDOUT/STDERR of all services, indexes them and makes this available to the user. It can be used in parallel, or in place of a traditional syslog daemon, such as rsyslog or syslog-ng. For more information, see the initial announcement.

The journal has been part of Fedora since F17. With Fedora 18 it now has grown into a reliable, powerful tool to handle your logs. Note however, that on F17 and F18 the journal is configured by default to store logs only in a small ring-buffer in /run/log/journal, i.e. not persistent. This of course limits its usefulness quite drastically but is sufficient to show a bit of recent log history in systemctl status. For Fedora 19, we plan to change this, and enable persistent logging by default. Then, journal files will be stored in /var/log/journal and can grow much larger, thus making the journal a lot more useful.

Enabling Persistency

In the meantime, on F17 or F18, you can enable journald's persistent storage manually:

# mkdir -p /var/log/journal

After that, it's a good idea to reboot, to get some useful structured data into your journal to play with. Oh, and since you have the journal now, you don't need syslog anymore (unless having /var/log/messages as text file is a necessity for you.), so you can choose to deinstall rsyslog:

# yum remove rsyslog


Now we are ready to go. The following text shows a lot of features of systemd 195 as it will be included in Fedora 18[1], so if your F17 can't do the tricks you see, please wait for F18. First, let's start with some basics. To access the logs of the journal use the journalctl(1) tool. To have a first look at the logs, just type in:

# journalctl

If you run this as root you will see all logs generated on the system, from system components the same way as for logged in users. The output you will get looks like a pixel-perfect copy of the traditional /var/log/messages format, but actually has a couple of improvements over it:

Note that in this blog story I will not actually show you any of the output this generates, I cut that out for brevity -- and to give you a reason to try it out yourself with a current image for F18's development version with systemd 195. But I do hope you get the idea anyway.

Access Control

Browsing logs this way is already pretty nice. But requiring to be root sucks of course, even administrators tend to do most of their work as unprivileged users these days. By default, Journal users can only watch their own logs, unless they are root or in the adm group. To make watching system logs more fun, let's add ourselves to adm:

# usermod -a -G adm lennart

After logging out and back in as lennart I know have access to the full journal of the system and all users:

$ journalctl

Live View

If invoked without parameters journalctl will show me the current log database. Sometimes one needs to watch logs as they grow, where one previously used tail -f /var/log/messages:

$ journalctl -f

Yes, this does exactly what you expect it to do: it will show you the last ten logs lines and then wait for changes and show them as they take place.

Basic Filtering

When invoking journalctl without parameters you'll see the whole set of logs, beginning with the oldest message stored. That of course, can be a lot of data. Much more useful is just viewing the logs of the current boot:

$ journalctl -b

This will show you only the logs of the current boot, with all the aforementioned gimmicks mentioned. But sometimes even this is way too much data to process. So what about just listing all the real issues to care about: all messages of priority levels ERROR and worse, from the current boot:

$ journalctl -b -p err

If you reboot only seldom the -b makes little sense, filtering based on time is much more useful:

$ journalctl --since=yesterday

And there you go, all log messages from the day before at 00:00 in the morning until right now. Awesome! Of course, we can combine this with -p err or a similar match. But humm, we are looking for something that happened on the 15th of October, or was it the 16th?

$ journalctl --since=2012-10-15 --until="2011-10-16 23:59:59"

Yupp, there we go, we found what we were looking for. But humm, I noticed that some CGI script in Apache was acting up earlier today, let's see what Apache logged at that time:

$ journalctl -u httpd --since=00:00 --until=9:30

Oh, yeah, there we found it. But hey, wasn't there an issue with that disk /dev/sdc? Let's figure out what was going on there:

$ journalctl /dev/sdc

OMG, a disk error![2] Hmm, let's quickly replace the disk before we lose data. Done! Next! -- Hmm, didn't I see that the vpnc binary made a booboo? Let's check for that:

$ journalctl /usr/sbin/vpnc

Hmm, I don't get this, this seems to be some weird interaction with dhclient, let's see both outputs, interleaved:

$ journalctl /usr/sbin/vpnc /usr/sbin/dhclient

That did it! Found it!

Advanced Filtering

Whew! That was awesome already, but let's turn this up a notch. Internally systemd stores each log entry with a set of implicit meta data. This meta data looks a lot like an environment block, but actually is a bit more powerful: values can take binary, large values (though this is the exception, and usually they just contain UTF-8), and fields can have multiple values assigned (an exception too, usually they only have one value). This implicit meta data is collected for each and every log message, without user intervention. The data will be there, and wait to be used by you. Let's see how this looks:

$ journalctl -o verbose -n
Tue, 2012-10-23 23:51:38 CEST [s=ac9e9c423355411d87bf0ba1a9b424e8;i=4301;b=5335e9cf5d954633bb99aefc0ec38c25;m=882ee28d2;t=4ccc0f98326e6;x=f21e8b1b0994d7ee]
        _CMDLINE=avahi-daemon: registering [epsilon.local]
        MESSAGE=Joining mDNS multicast group on interface wlan0.IPv4 with address

(I cut out a lot of noise here, I don't want to make this story overly long. -n without parameter shows you the last 10 log entries, but I cut out all but the last.)

With the -o verbose switch we enabled verbose output. Instead of showing a pixel-perfect copy of classic /var/log/messages that only includes a minimimal subset of what is available we now see all the gory details the journal has about each entry. But it's highly interesting: there is user credential information, SELinux bits, machine information and more. For a full list of common, well-known fields, see the man page.

Now, as it turns out the journal database is indexed by all of these fields, out-of-the-box! Let's try this out:

$ journalctl _UID=70

And there you go, this will show all log messages logged from Linux user ID 70. As it turns out one can easily combine these matches:

$ journalctl _UID=70 _UID=71

Specifying two matches for the same field will result in a logical OR combination of the matches. All entries matching either will be shown, i.e. all messages from either UID 70 or 71.

$ journalctl _HOSTNAME=epsilon _COMM=avahi-daemon

You guessed it, if you specify two matches for different field names, they will be combined with a logical AND. All entries matching both will be shown now, meaning that all messages from processes named avahi-daemon and host epsilon.

But of course, that's not fancy enough for us. We are computer nerds after all, we live off logical expressions. We must go deeper!

$ journalctl _HOSTNAME=theta _UID=70 + _HOSTNAME=epsilon _COMM=avahi-daemon

The + is an explicit OR you can use in addition to the implied OR when you match the same field twice. The line above hence means: show me everything from host theta with UID 70, or of host epsilon with a process name of avahi-daemon.

And now, it becomes magic!

That was already pretty cool, right? Righ! But heck, who can remember all those values a field can take in the journal, I mean, seriously, who has thaaaat kind of photographic memory? Well, the journal has:

$ journalctl -F _SYSTEMD_UNIT

This will show us all values the field _SYSTEMD_UNIT takes in the database, or in other words: the names of all systemd services which ever logged into the journal. This makes it super-easy to build nice matches. But wait, turns out this all is actually hooked up with shell completion on bash! This gets even more awesome: as you type your match expression you will get a list of well-known field names, and of the values they can take! Let's figure out how to filter for SELinux labels again. We remember the field name was something with SELINUX in it, let's try that:

$ journalctl _SE<TAB>

And yupp, it's immediately completed:

$ journalctl _SELINUX_CONTEXT=

Cool, but what's the label again we wanted to match for?

$ journalctl _SELINUX_CONTEXT=<TAB><TAB>
kernel                                                       system_u:system_r:local_login_t:s0-s0:c0.c1023               system_u:system_r:udev_t:s0-s0:c0.c1023
system_u:system_r:accountsd_t:s0                             system_u:system_r:lvm_t:s0                                   system_u:system_r:virtd_t:s0-s0:c0.c1023
system_u:system_r:avahi_t:s0                                 system_u:system_r:modemmanager_t:s0-s0:c0.c1023              system_u:system_r:vpnc_t:s0
system_u:system_r:bluetooth_t:s0                             system_u:system_r:NetworkManager_t:s0                        system_u:system_r:xdm_t:s0-s0:c0.c1023
system_u:system_r:chkpwd_t:s0-s0:c0.c1023                    system_u:system_r:policykit_t:s0                             unconfined_u:system_r:rpm_t:s0-s0:c0.c1023
system_u:system_r:chronyd_t:s0                               system_u:system_r:rtkit_daemon_t:s0                          unconfined_u:system_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:crond_t:s0-s0:c0.c1023                     system_u:system_r:syslogd_t:s0                               unconfined_u:system_r:useradd_t:s0-s0:c0.c1023
system_u:system_r:devicekit_disk_t:s0                        system_u:system_r:system_cronjob_t:s0-s0:c0.c1023            unconfined_u:unconfined_r:unconfined_dbusd_t:s0-s0:c0.c1023
system_u:system_r:dhcpc_t:s0                                 system_u:system_r:system_dbusd_t:s0-s0:c0.c1023              unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
system_u:system_r:dnsmasq_t:s0-s0:c0.c1023                   system_u:system_r:systemd_logind_t:s0
system_u:system_r:init_t:s0                                  system_u:system_r:systemd_tmpfiles_t:s0

Ah! Right! We wanted to see everything logged under PolicyKit's security label:

$ journalctl _SELINUX_CONTEXT=system_u:system_r:policykit_t:s0

Wow! That was easy! I didn't know anything related to SELinux could be thaaat easy! ;-) Of course this kind of completion works with any field, not just SELinux labels.

So much for now. There's a lot more cool stuff in journalctl(1) than this. For example, it generates JSON output for you! You can match against kernel fields! You can get simple /var/log/messages-like output but with relative timestamps! And so much more!

Anyway, in the next weeks I hope to post more stories about all the cool things the journal can do for you. This is just the beginning, stay tuned.


[1] systemd 195 is currently still in Bodhi but hopefully will get into F18 proper soon, and definitely before the release of Fedora 18.

[2] OK, I cheated here, indexing by block device is not in the kernel yet, but on its way due to Hannes' fantastic work, and I hope it will make appearence in F18.

posted at: 00:16 | path: /projects | permanent link to this entry | comments

Sat, 13 Oct 2012

systemd for Administrators, Part XVI

And, yes, here's now the sixteenth installment of my ongoing series on systemd for Administrators:

Gettys on Serial Consoles (and Elsewhere)

TL;DR: To make use of a serial console, just use console=ttyS0 on the kernel command line, and systemd will automatically start a getty on it for you.

While physical RS232 serial ports have become exotic in today's PCs they play an important role in modern servers and embedded hardware. They provide a relatively robust and minimalistic way to access the console of your device, that works even when the network is hosed, or the primary UI is unresponsive. VMs frequently emulate a serial port as well.

Of course, Linux has always had good support for serial consoles, but with systemd we tried to make serial console support even simpler to use. In the following text I'll try to give an overview how serial console gettys on systemd work, and how TTYs of any kind are handled.

Let's start with the key take-away: in most cases, to get a login prompt on your serial prompt you don't need to do anything. systemd checks the kernel configuration for the selected kernel console and will simply spawn a serial getty on it. That way it is entirely sufficient to configure your kernel console properly (for example, by adding console=ttyS0 to the kernel command line) and that's it. But let's have a look at the details:

In systemd, two template units are responsible for bringing up a login prompt on text consoles:

  1. getty@.service is responsible for virtual terminal (VT) login prompts, i.e. those on your VGA screen as exposed in /dev/tty1 and similar devices.
  2. serial-getty@.service is responsible for all other terminals, including serial ports such as /dev/ttyS0. It differs in a couple of ways from getty@.service: among other things the $TERM environment variable is set to vt102 (hopefully a good default for most serial terminals) rather than linux (which is the right choice for VTs only), and a special logic that clears the VT scrollback buffer (and only work on VTs) is skipped.
Virtual Terminals

Let's have a closer look how getty@.service is started, i.e. how login prompts on the virtual terminal (i.e. non-serial TTYs) work. Traditionally, the init system on Linux machines was configured to spawn a fixed number login prompts at boot. In most cases six instances of the getty program were spawned, on the first six VTs, tty1 to tty6.

In a systemd world we made this more dynamic: in order to make things more efficient login prompts are now started on demand only. As you switch to the VTs the getty service is instantiated to getty@tty2.service, getty@tty5.service and so on. Since we don't have to unconditionally start the getty processes anymore this allows us to save a bit of resources, and makes start-up a bit faster. This behaviour is mostly transparent to the user: if the user activates a VT the getty is started right-away, so that the user will hardly notice that it wasn't running all the time. If he then logs in and types ps he'll notice however that getty instances are only running for the VTs he so far switched to.

By default this automatic spawning is done for the VTs up to VT6 only (in order to be close to the traditional default configuration of Linux systems)[1]. Note that the auto-spawning of gettys is only attempted if no other subsystem took possession of the VTs yet. More specifically, if a user makes frequent use of fast user switching via GNOME he'll get his X sessions on the first six VTs, too, since the lowest available VT is allocated for each session.

Two VTs are handled specially by the auto-spawning logic: firstly tty1 gets special treatment: if we boot into graphical mode the display manager takes possession of this VT. If we boot into multi-user (text) mode a getty is started on it -- unconditionally, without any on-demand logic[2].

Secondly, tty6 is especially reserved for auto-spawned gettys and unavailable to other subsystems such as X[3]. This is done in order to ensure that there's always a way to get a text login, even if due to fast user switching X took possession of more than 5 VTs.

Serial Terminals

Handling of login prompts on serial terminals (and all other kind of non-VT terminals) is different from that of VTs. By default systemd will instantiate one serial-getty@.service on the main kernel[4] console, if it is not a virtual terminal. The kernel console is where the kernel outputs its own log messages and is usually configured on the kernel command line in the boot loader via an argument such as console=ttyS0[5]. This logic ensures that when the user asks the kernel to redirect its output onto a certain serial terminal, he will automatically also get a login prompt on it as the boot completes[6]. systemd will also spawn a login prompt on the first special VM console (that's /dev/hvc0, /dev/xvc0, /dev/hvsi0), if the system is run in a VM that provides these devices. This logic is implemented in a generator called systemd-getty-generator that is run early at boot and pulls in the necessary services depending on the execution environment.

In many cases, this automatic logic should already suffice to get you a login prompt when you need one, without any specific configuration of systemd. However, sometimes there's the need to manually configure a serial getty, for example, if more than one serial login prompt is needed or the kernel console should be redirected to a different terminal than the login prompt. To facilitate this it is sufficient to instantiate serial-getty@.service once for each serial port you want it to run on[7]:

# systemctl enable serial-getty@ttyS2.service
# systemctl start serial-getty@ttyS2.service

And that's it. This will make sure you get the login prompt on the chosen port on all subsequent boots, and starts it right-away too.

Sometimes, there's the need to configure the login prompt in even more detail. For example, if the default baud rate configured by the kernel is not correct or other agetty parameters need to be changed. In such a case simply copy the default unit template to /etc/systemd/system and edit it there:

# cp /usr/lib/systemd/system/serial-getty@.service /etc/systemd/system/serial-getty@ttyS2.service
# vi /etc/systemd/system/serial-getty@ttyS2.service
 .... now make your changes to the agetty command line ...
# ln -s /etc/systemd/system/serial-getty@ttyS2.service /etc/systemd/system/getty.target.wants/
# systemctl daemon-reload
# systemctl start serial-getty@ttyS2.service

This creates a unit file that is specific to serial port ttyS2, so that you can make specific changes to this port and this port only.

And this is pretty much all there's to say about serial ports, VTs and login prompts on them. I hope this was interesting, and please come back soon for the next installment of this series!


[1] You can easily modify this by changing NAutoVTs= in logind.conf.

[2] Note that whether the getty on VT1 is started on-demand or not hardly makes a difference, since VT1 is the default active VT anyway, so the demand is there anyway at boot.

[3] You can easily change this special reserved VT by modifying ReserveVT= in logind.conf.

[4] If multiple kernel consoles are used simultaneously, the main console is the one listed first in /sys/class/tty/console/active, which is the last one listed on the kernel command line.

[5] See kernel-parameters.txt for more information on this kernel command line option.

[6] Note that agetty -s is used here so that the baud rate configured at the kernel command line is not altered and continued to be used by the login prompt.

[7] Note that this systemctl enable syntax only works with systemd 188 and newer (i.e. F18). On older versions use ln -s /usr/lib/systemd/system/serial-getty@.service /etc/systemd/system/getty.target.wants/serial-getty@ttyS2.service ; systemctl daemon-reload instead.

posted at: 02:56 | path: /projects | permanent link to this entry | comments

Mon, 06 Aug 2012

Berlin Open Source Meetup


Chris Kühl and I are organizing a Berlin Open Source Meetup on Aug 19th at the Prater Biergarten in Prenzlauer Berg. If you live in Berlin (or are passing by) and are involved in or interested in Open Source then you are invited!

There's also a Google+ event for the meetup.

It's a public event, so everybody is welcome, and please feel free to invite others!

See you at the Prater!

posted at: 14:59 | path: /projects | permanent link to this entry | comments

Sun, 08 Jul 2012

foss.in 2012 CFP Ends in a Few Hours

foss.in 2012 in Bangalore takes place again after a hiatus of some years. It has always been a fantastic conference, and a great opportunity to visit Bangalore and India. I just submitted my talk proposals, so, hurry up, and submit yours!

posted at: 15:47 | path: /projects | permanent link to this entry | comments

Thu, 28 Jun 2012

systemd for Administrators, Part XV

Quickly following the previous iteration, here's now the fifteenth installment of my ongoing series on systemd for Administrators:


There are three big target audiences we try to cover with systemd: the embedded/mobile folks, the desktop people and the server folks. While the systems used by embedded/mobile tend to be underpowered and have few resources are available, desktops tend to be much more powerful machines -- but still much less resourceful than servers. Nonetheless there are surprisingly many features that matter to both extremes of this axis (embedded and servers), but not the center (desktops). On of them is support for watchdogs in hardware and software.

Embedded devices frequently rely on watchdog hardware that resets it automatically if software stops responding (more specifically, stops signalling the hardware in fixed intervals that it is still alive). This is required to increase reliability and make sure that regardless what happens the best is attempted to get the system working again. Functionality like this makes little sense on the desktop[1]. However, on high-availability servers watchdogs are frequently used, again.

Starting with version 183 systemd provides full support for hardware watchdogs (as exposed in /dev/watchdog to userspace), as well as supervisor (software) watchdog support for invidual system services. The basic idea is the following: if enabled, systemd will regularly ping the watchdog hardware. If systemd or the kernel hang this ping will not happen anymore and the hardware will automatically reset the system. This way systemd and the kernel are protected from boundless hangs -- by the hardware. To make the chain complete, systemd then exposes a software watchdog interface for individual services so that they can also be restarted (or some other action taken) if they begin to hang. This software watchdog logic can be configured individually for each service in the ping frequency and the action to take. Putting both parts together (i.e. hardware watchdogs supervising systemd and the kernel, as well as systemd supervising all other services) we have a reliable way to watchdog every single component of the system.

To make use of the hardware watchdog it is sufficient to set the RuntimeWatchdogSec= option in /etc/systemd/system.conf. It defaults to 0 (i.e. no hardware watchdog use). Set it to a value like 20s and the watchdog is enabled. After 20s of no keep-alive pings the hardware will reset itself. Note that systemd will send a ping to the hardware at half the specified interval, i.e. every 10s. And that's already all there is to it. By enabling this single, simple option you have turned on supervision by the hardware of systemd and the kernel beneath it.[2]

Note that the hardware watchdog device (/dev/watchdog) is single-user only. That means that you can either enable this functionality in systemd, or use a separate external watchdog daemon, such as the aptly named watchdog.

ShutdownWatchdogSec= is another option that can be configured in /etc/systemd/system.conf. It controls the watchdog interval to use during reboots. It defaults to 10min, and adds extra reliability to the system reboot logic: if a clean reboot is not possible and shutdown hangs, we rely on the watchdog hardware to reset the system abruptly, as extra safety net.

So much about the hardware watchdog logic. These two options are really everything that is necessary to make use of the hardware watchdogs. Now, let's have a look how to add watchdog logic to individual services.

First of all, to make software watchdog-supervisable it needs to be patched to send out "I am alive" signals in regular intervals in its event loop. Patching this is relatively easy. First, a daemon needs to read the WATCHDOG_USEC= environment variable. If it is set, it will contain the watchdog interval in usec formatted as ASCII text string, as it is configured for the service. The daemon should then issue sd_notify("WATCHDOG=1") calls every half of that interval. A daemon patched this way should transparently support watchdog functionality by checking whether the environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been patched to support the logic pointed out above) it is sufficient to set the WatchdogSec= to the desired failure latency. See systemd.service(5) for details on this setting. This causes WATCHDOG_USEC= to be set for the service's processes and will cause the service to enter a failure state as soon as no keep-alive ping is received within the configured interval.

If a service enters a failure state as soon as the watchdog logic detects a hang, then this is hardly sufficient to build a reliable system. The next step is to configure whether the service shall be restarted and how often, and what to do if it then still fails. To enable automatic service restarts on failure set Restart=on-failure for the service. To configure how many times a service shall be attempted to be restarted use the combination of StartLimitBurst= and StartLimitInterval= which allow you to configure how often a service may restart within a time interval. If that limit is reached, a special action can be taken. This action is configured with StartLimitAction=. The default is a none, i.e. that no further action is taken and the service simply remains in the failure state without any further attempted restarts. The other three possible values are reboot, reboot-force and reboot-immediate. reboot attempts a clean reboot, going through the usual, clean shutdown logic. reboot-force is more abrupt: it will not actually try to cleanly shutdown any services, but immediately kills all remaining services and unmounts all file systems and then forcibly reboots (this way all file systems will be clean but reboot will still be very fast). Finally, reboot-immediate does not attempt to kill any process or unmount any file systems. Instead it just hard reboots the machine without delay. reboot-immediate hence comes closest to a reboot triggered by a hardware watchdog. All these settings are documented in systemd.service(5).

Putting this all together we now have pretty flexible options to watchdog-supervise a specific service and configure automatic restarts of the service if it hangs, plus take ultimate action if that doesn't help.

Here's an example unit file:

Description=My Little Daemon


This service will automatically be restarted if it hasn't pinged the system manager for longer than 30s or if it fails otherwise. If it is restarted this way more often than 4 times in 5min action is taken and the system quickly rebooted, with all file systems being clean when it comes up again.

And that's already all I wanted to tell you about! With hardware watchdog support right in PID 1, as well as supervisor watchdog support for individual services we should provide everything you need for most watchdog usecases. Regardless if you are building an embedded or mobile applience, or if your are working with high-availability servers, please give this a try!

(Oh, and if you wonder why in heaven PID 1 needs to deal with /dev/watchdog, and why this shouldn't be kept in a separate daemon, then please read this again and try to understand that this is all about the supervisor chain we are building here, where the hardware watchdog supervises systemd, and systemd supervises the individual services. Also, we believe that a service not responding should be treated in a similar way as any other service error. Finally, pinging /dev/watchdog is one of the most trivial operations in the OS (basically little more than a ioctl() call), to the support for this is not more than a handful lines of code. Maintaining this externally with complex IPC between PID 1 (and the daemons) and this watchdog daemon would be drastically more complex, error-prone and resource intensive.)

Note that the built-in hardware watchdog support of systemd does not conflict with other watchdog software by default. systemd does not make use of /dev/watchdog by default, and you are welcome to use external watchdog daemons in conjunction with systemd, if this better suits your needs.

And one last thing: if you wonder whether your hardware has a watchdog, then the answer is: almost definitely yes -- if it is anything more recent than a few years. If you want to verify this, try the wdctl tool from recent util-linux, which shows you everything you need to know about your watchdog hardware.

I'd like to thank the great folks from Pengutronix for contributing most of the watchdog logic. Thank you!


[1] Though actually most desktops tend to include watchdog hardware these days too, as this is cheap to build and available in most modern PC chipsets.

[2] So, here's a free tip for you if you hack on the core OS: don't enable this feature while you hack. Otherwise your system might suddenly reboot if you are in the middle of tracing through PID 1 with gdb and cause it to be stopped for a moment, so that no hardware ping can be done...

posted at: 00:07 | path: /projects | permanent link to this entry | comments

Wed, 27 Jun 2012

systemd for Administrators, Part XIV

And here's the fourteenth installment of my ongoing series on systemd for Administrators:

The Self-Explanatory Boot

One complaint we often hear about systemd is that its boot process was hard to understand, even incomprehensible. In general I can only disagree with this sentiment, I even believe in quite the opposite: in comparison to what we had before -- where to even remotely understand what was going on you had to have a decent comprehension of the programming language that is Bourne Shell[1] -- understanding systemd's boot process is substantially easier. However, like in many complaints there is some truth in this frequently heard discomfort: for a seasoned Unix administrator there indeed is a bit of learning to do when the switch to systemd is made. And as systemd developers it is our duty to make the learning curve shallow, introduce as few surprises as we can, and provide good documentation where that is not possible.

systemd always had huge body of documentation as manual pages (nearly 100 individual pages now!), in the Wiki and the various blog stories I posted. However, any amount of documentation alone is not enough to make software easily understood. In fact, thick manuals sometimes appear intimidating and make the reader wonder where to start reading, if all he was interested in was this one simple concept of the whole system.

Acknowledging all this we have now added a new, neat, little feature to systemd: the self-explanatory boot process. What do we mean by that? Simply that each and every single component of our boot comes with documentation and that this documentation is closely linked to its component, so that it is easy to find.

More specifically, all units in systemd (which are what encapsulate the components of the boot) now include references to their documentation, the documentation of their configuration files and further applicable manuals. A user who is trying to understand the purpose of a unit, how it fits into the boot process and how to configure it can now easily look up this documentation with the well-known systemctl status command. Here's an example how this looks for systemd-logind.service:

$ systemctl status systemd-logind.service
systemd-logind.service - Login Service
	  Loaded: loaded (/usr/lib/systemd/system/systemd-logind.service; static)
	  Active: active (running) since Mon, 25 Jun 2012 22:39:24 +0200; 1 day and 18h ago
	    Docs: man:systemd-logind.service(7)
	Main PID: 562 (systemd-logind)
	  CGroup: name=systemd:/system/systemd-logind.service
		  └ 562 /usr/lib/systemd/systemd-logind

Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event2 (Power Button)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event6 (Video Bus)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event0 (Lid Switch)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event1 (Sleep Button)
Jun 25 22:39:24 epsilon systemd-logind[562]: Watching system buttons on /dev/input/event7 (ThinkPad Extra Buttons)
Jun 25 22:39:25 epsilon systemd-logind[562]: New session 1 of user gdm.
Jun 25 22:39:25 epsilon systemd-logind[562]: Linked /tmp/.X11-unix/X0 to /run/user/42/X11-display.
Jun 25 22:39:32 epsilon systemd-logind[562]: New session 2 of user lennart.
Jun 25 22:39:32 epsilon systemd-logind[562]: Linked /tmp/.X11-unix/X0 to /run/user/500/X11-display.
Jun 25 22:39:54 epsilon systemd-logind[562]: Removed session 1.

On the first look this output changed very little. If you look closer however you will find that it now includes one new field: Docs lists references to the documentation of this service. In this case there are two man page URIs and one web URL specified. The man pages describe the purpose and configuration of this service, the web URL includes an introduction to the basic concepts of this service.

If the user uses a recent graphical terminal implementation it is sufficient to click on the URIs shown to get the respective documentation[2]. With other words: it never has been that easy to figure out what a specific component of our boot is about: just use systemctl status to get more information about it and click on the links shown to find the documentation.

The past days I have written man pages and added these references for every single unit we ship with systemd. This means, with systemctl status you now have a very easy way to find out more about every single service of the core OS.

If you are not using a graphical terminal (where you can just click on URIs), a man page URI in the middle of the output of systemctl status is not the most useful thing to have. To make reading the referenced man pages easier we have also added a new command:

systemctl help systemd-logind.service

Which will open the listed man pages right-away, without the need to click anything or copy/paste an URI.

The URIs are in the formats documented by the uri(7) man page. Units may reference http and https URLs, as well as man and info pages.

Of course all this doesn't make everything self-explanatory, simply because the user still has to find out about systemctl status (and even systemctl in the first place so that he even knows what units there are); however with this basic knowledge further help on specific units is in very easy reach.

We hope that this kind of interlinking of runtime behaviour and the matching documentation is a big step forward to make our boot easier to understand.

This functionality is partially already available in Fedora 17, and will show up in complete form in Fedora 18.

That all said, credit where credit is due: this kind of references to documentation within the service descriptions is not new, Solaris' SMF had similar functionality for quite some time. However, we believe this new systemd feature is certainly a novelty on Linux, and with systemd we now offer you the best documented and best self-explaining init system.

Of course, if you are writing unit files for your own packages, please consider also including references to the documentation of your services and its configuration. This is really easy to do, just list the URIs in the new Documentation= field in the [Unit] section of your unit files. For details see systemd.unit(5). The more comprehensively we include links to documentation in our OS services the easier the work of administrators becomes. (To make sure Fedora makes comprehensive use of this functionality I filed a bug on FPC).

Oh, and BTW: if you are looking for a rough overview of systemd's boot process here's another new man page we recently added, which includes a pretty ASCII flow chart of the boot process and the units involved.


[1] Which TBH is a pretty crufty, strange one on top.

[2] Well, a terminal where this bug is fixed (used together with a help browser where this one is fixed).

posted at: 17:45 | path: /projects | permanent link to this entry | comments

Thu, 24 May 2012

Presentation in Warsaw

I recently had the chance to speak about systemd and other projects, as well as the politics behind them at a Bar Camp in Warsaw, organized by the fine people of OSEC. The presentation has been recorded, and has now been posted online. It's a very long recording (1:43h), but it's quite interesting (as I'd like to believe) and contains a bit of background where we are coming from and where are going to. Anyway, please have a look. Enjoy!

I'd like to thank the organizers for this great event and for publishing the recording online.

posted at: 22:06 | path: /projects | permanent link to this entry | comments

Fri, 18 May 2012

systemd for Administrators, Part XIII

Here's the thirteenth installment of my ongoing series on systemd for Administrators:

Log and Service Status

This one is a short episode. One of the most commonly used commands on a systemd system is systemctl status which may be used to determine the status of a service (or other unit). It always has been a valuable tool to figure out the processes, runtime information and other meta data of a daemon running on the system.

With Fedora 17 we introduced the journal, our new logging scheme that provides structured, indexed and reliable logging on systemd systems, while providing a certain degree of compatibility with classic syslog implementations. The original reason we started to work on the journal was one specific feature idea, that to the outsider might appear simple but without the journal is difficult and inefficient to implement: along with the output of systemctl status we wanted to show the last 10 log messages of the daemon. Log data is some of the most essential bits of information we have on the status of a service. Hence it it is an obvious choice to show next to the general status of the service.

And now to make it short: at the same time as we integrated the journal into systemd and Fedora we also hooked up systemctl with it. Here's an example output:

$ systemctl status avahi-daemon.service
avahi-daemon.service - Avahi mDNS/DNS-SD Stack
	  Loaded: loaded (/usr/lib/systemd/system/avahi-daemon.service; enabled)
	  Active: active (running) since Fri, 18 May 2012 12:27:37 +0200; 14s ago
	Main PID: 8216 (avahi-daemon)
	  Status: "avahi-daemon 0.6.30 starting up."
	  CGroup: name=systemd:/system/avahi-daemon.service
		  ├ 8216 avahi-daemon: running [omega.local]
		  └ 8217 avahi-daemon: chroot helper

May 18 12:27:37 omega avahi-daemon[8216]: Joining mDNS multicast group on interface eth1.IPv4 with address
May 18 12:27:37 omega avahi-daemon[8216]: New relevant interface eth1.IPv4 for mDNS.
May 18 12:27:37 omega avahi-daemon[8216]: Network interface enumeration completed.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for on virbr0.IPv4.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for fd00::e269:95ff:fe87:e282 on eth1.*.
May 18 12:27:37 omega avahi-daemon[8216]: Registering new address record for on eth1.IPv4.
May 18 12:27:37 omega avahi-daemon[8216]: Registering HINFO record with values 'X86_64'/'LINUX'.
May 18 12:27:38 omega avahi-daemon[8216]: Server startup complete. Host name is omega.local. Local service cookie is 3555095952.
May 18 12:27:38 omega avahi-daemon[8216]: Service "omega" (/services/ssh.service) successfully established.
May 18 12:27:38 omega avahi-daemon[8216]: Service "omega" (/services/sftp-ssh.service) successfully established.

This, of course, shows the status of everybody's favourite mDNS/DNS-SD daemon with a list of its processes, along with -- as promised -- the 10 most recent log lines. Mission accomplished!

There are a couple of switches available to alter the output slightly and adjust it to your needs. The two most interesting switches are -f to enable follow mode (as in tail -f) and -n to change the number of lines to show (you guessed it, as in tail -n).

The log data shown comes from three sources: everything any of the daemon's processes logged with libc's syslog() call, everything submitted using the native Journal API, plus everything any of the daemon's processes logged to STDOUT or STDERR. In short: everything the daemon generates as log data is collected, properly interleaved and shown in the same format.

And that's it already for today. It's a very simple feature, but an immensely useful one for every administrator. One of the kind "Why didn't we already do this 15 years ago?".

Stay tuned for the next installment!

posted at: 12:37 | path: /projects | permanent link to this entry | comments

Thu, 03 May 2012

Boot & Base OS Miniconf at Linux Plumbers Conference 2012, San Diego

Linux Plumbers Conference Logo

We are working on putting together a miniconf on the topic of Boot & Base OS for the Linux Plumbers Conference 2012 in San Diego (Aug 29-31). And we need your submission!

Are you working on some exciting project related to Boot and Base OS and would like to present your work? Then please submit something following these guidelines, but please CC Kay Sievers and Lennart Poettering.

I hope that at this point the Linux Plumbers Conference needs little introduction, so I will spare any further prose on how great and useful and the best conference ever it is for everybody who works on the plumbing layer of Linux. However, there's one conference that will be co-located with LPC that is still little known, because it happens for the first time: The C Conference, organized by Brandon Philips and friends. It covers all things C, and they are still looking for more topics, in a reverse CFP. Please consider submitting a proposal and registering to the conference!

Conference Logo

posted at: 20:42 | path: /projects | permanent link to this entry | comments

Tue, 01 May 2012

The Most Awesome, Least-Advertised Fedora 17 Feature

There's one feature In the upcoming Fedora 17 release that is immensly useful but very little known, since its feature page 'ckremoval' does not explicitly refer to it in its name: true automatic multi-seat support for Linux.

A multi-seat computer is a system that offers not only one local seat for a user, but multiple, at the same time. A seat refers to a combination of a screen, a set of input devices (such as mice and keyboards), and maybe an audio card or webcam, as individual local workplace for a user. A multi-seat computer can drive an entire class room of seats with only a fraction of the cost in hardware, energy, administration and space: you only have one PC, which usually has way enough CPU power to drive 10 or more workplaces. (In fact, even a Netbook has fast enough to drive a couple of seats!) Automatic multi-seat refers to an entirely automatically managed seat setup: whenever a new seat is plugged in a new login screen immediately appears -- without any manual configuration --, and when the seat is unplugged all user sessions on it are removed without delay.

In Fedora 17 we added this functionality to the low-level user and device tracking of systemd, replacing the previous ConsoleKit logic that lacked support for automatic multi-seat. With all the ground work done in systemd, udev and the other components of our plumbing layer the last remaining bits were surprisingly easy to add.

Currently, the automatic multi-seat logic works best with the USB multi-seat hardware from Plugable you can buy cheaply on Amazon (US). These devices require exactly zero configuration with the new scheme implemented in Fedora 17: just plug them in at any time, login screens pop up on them, and you have your additional seats. Alternatively you can also assemble your seat manually with a few easy loginctl attach commands, from any kind of hardware you might have lying around. To get a full seat you need multiple graphics cards, keyboards and mice: one set for each seat. (Later on we'll probably have a graphical setup utility for additional seats, but that's not a pressing issue we believe, as the plug-n-play multi-seat support with the Plugable devices is so awesomely nice.)

Plugable provided us for free with hardware for testing multi-seat. They are also involved with the upstream development of the USB DisplayLink driver for Linux. Due to their positive involvement with Linux we can only recommend to buy their hardware. They are good guys, and support Free Software the way all hardware vendors should! (And besides that, their hardware is also nicely put together. For example, in contrast to most similar vendors they actually assign proper vendor/product IDs to their USB hardware so that we can easily recognize their hardware when plugged in to set up automatic seats.)

Currently, all this magic is only implemented in the GNOME stack with the biggest component getting updated being the GNOME Display Manager. On the Plugable USB hardware you get a full GNOME Shell session with all the usual graphical gimmicks, the same way as on any other hardware. (Yes, GNOME 3 works perfectly fine on simpler graphics cards such as these USB devices!) If you are hacking on a different desktop environment, or on a different display manager, please have a look at the multi-seat documentation we put together, and particularly at our short piece about writing display managers which are multi-seat capable.

If you work on a major desktop environment or display manager and would like to implement multi-seat support for it, but lack the aforementioned Plugable hardware, we might be able to provide you with the hardware for free. Please contact us directly, and we might be able to send you a device. Note that we don't have unlimited devices available, hence we'll probably not be able to pass hardware to everybody who asks, and we will pass the hardware preferably to people who work on well-known software or otherwise have contributed good code to the community already. Anyway, if in doubt, ping us, and explain to us why you should get the hardware, and we'll consider you! (Oh, and this not only applies to display managers, if you hack on some other software where multi-seat awareness would be truly useful, then don't hesitate and ping us!)

Phoronix has this story about this new multi-seat support which is quite interesting and full of pictures. Please have a look.

Plugable started a Pledge drive to lower the price of the Plugable USB multi-seat terminals further. It's full of pictures (and a video showing all this in action!), and uses the code we now make available in Fedora 17 as base. Please consider pledging a few bucks.

Recently David Zeuthen added multi-seat support to udisks as well. With this in place, a user logged in on a specific seat can only see the USB storage plugged into his individual seat, but does not see any USB storage plugged into any other local seat. With this in place we closed the last missing bit of multi-seat support in our desktop stack.

With this code in Fedora 17 we cover the big use cases of multi-seat already: internet cafes, class rooms and similar installations can provide PC workplaces cheaply and easily without any manual configuration. Later on we want to build on this and make this useful for different uses too: for example, the ability to get a login screen as easily as plugging in a USB connector makes this not useful only for saving money in setups for many people, but also in embedded environments (consider monitoring/debugging screens made available via this hotplug logic) or servers (get trivially quick local access to your otherwise head-less server). To be truly useful in these areas we need one more thing though: the ability to run a simply getty (i.e. text login) on the seat, without necessarily involving a graphical UI.

The well-known X successor Wayland already comes out of the box with multi-seat support based on this logic.

Oh, and BTW, as Ubuntu appears to be "focussing" on "clarity" in the "cloud" now ;-), and chose Upstart instead of systemd, this feature won't be available in Ubuntu any time soon. That's (one detail of) the price Ubuntu has to pay for choosing to maintain it's own (largely legacy, such as ConsoleKit) plumbing stack.

Multi-seat has a long history on Unix. Since the earliest days Unix systems could be accessed by multiple local terminals at the same time. Since then local terminal support (and hence multi-seat) gradually moved out of view in computing. The fewest machines these days have more than one seat, the concept of terminals survived almost exclusively in the context of PTYs (i.e. fully virtualized API objects, disconnected from any real hardware seat) and VCs (i.e. a single virtualized local seat), but almost not in any other way (well, server setups still use serial terminals for emergency remote access, but they almost never have more than one serial terminal). All what we do in systemd is based on the ideas originally brought forward in Unix; with systemd we now try to bring back a number of the good ideas of Unix that since the old times were lost on the roadside. For example, in true Unix style we already started to expose the concept of a service in the file system (in /sys/fs/cgroup/systemd/system/), something where on Linux the (often misunderstood) "everything is a file" mantra previously fell short. With automatic multi-seat support we bring back support for terminals, but updated with all the features of today's desktops: plug and play, zero configuration, full graphics, and not limited to input devices and screens, but extending to all kinds of devices, such as audio, webcams or USB memory sticks.

Anyway, this is all for now; I'd like to thank everybody who was involved with making multi-seat work so nicely and natively on the Linux platform. You know who you are! Thanks a ton!

posted at: 23:07 | path: /projects | permanent link to this entry | comments

Sat, 21 Apr 2012

systemd Status Update

It has been way too long since my last status update on systemd. Here's another short, incomprehensive status update on what we worked on for systemd since then.

We have been working hard to turn systemd into the most viable set of components to build operating systems, appliances and devices from, and make it the best choice for servers, for desktops and for embedded environments alike. I think we have a really convincing set of features now, but we are actively working on making it even better.

Here's a list of some more and some less interesting features, in no particular order:

  1. We added an automatic pager to systemctl (and related tools), similar to how git has it.
  2. systemctl learnt a new switch --failed, to show only failed services.
  3. You may now start services immediately, overrding all dependency logic by passing --ignore-dependencies to systemctl. This is mostly a debugging tool and nothing people should use in real life.
  4. Sending SIGKILL as final part of the implicit shutdown logic of services is now optional and may be configured with the SendSIGKILL= option individually for each service.
  5. We split off the Vala/Gtk tools into its own project systemd-ui.
  6. systemd-tmpfiles learnt file globbing and creating FIFO special files as well as character and block device nodes, and symlinks. It also is capable of relabelling certain directories at boot now (in the SELinux sense).
  7. Immediately before shuttding dow we will now invoke all binaries found in /lib/systemd/system-shutdown/, which is useful for debugging late shutdown.
  8. You may now globally control where STDOUT/STDERR of services goes (unless individual service configuration overrides it).
  9. There's a new ConditionVirtualization= option, that makes systemd skip a specific service if a certain virtualization technology is found or not found. Similar, we now have a new option to detect whether a certain security technology (such as SELinux) is available, called ConditionSecurity=. There's also ConditionCapability= to check whether a certain process capability is in the capability bounding set of the system. There's also a new ConditionFileIsExecutable=, ConditionPathIsMountPoint=, ConditionPathIsReadWrite=, ConditionPathIsSymbolicLink=.
  10. The file system condition directives now support globbing.
  11. Service conditions may now be "triggering" and "mandatory", meaning that they can be a necessary requirement to hold for a service to start, or simply one trigger among many.
  12. At boot time we now print warnings if: /usr is on a split-off partition but not already mounted by an initrd; if /etc/mtab is not a symlink to /proc/mounts; CONFIG_CGROUPS is not enabled in the kernel. We'll also expose this as tainted flag on the bus.
  13. You may now boot the same OS image on a bare metal machine and in Linux namespace containers and will get a clean boot in both cases. This is more complicated than it sounds since device management with udev or write access to /sys, /proc/sys or things like /dev/kmsg is not available in a container. This makes systemd a first-class choice for managing thin container setups. This is all tested with systemd's own systemd-nspawn tool but should work fine in LXC setups, too. Basically this means that you do not have to adjust your OS manually to make it work in a container environment, but will just work out of the box. It also makes it easier to convert real systems into containers.
  14. We now automatically spawn gettys on HVC ttys when booting in VMs.
  15. We introduced /etc/machine-id as a generalization of D-Bus machine ID logic. See this blog story for more information. On stateless/read-only systems the machine ID is initialized randomly at boot. In virtualized environments it may be passed in from the machine manager (with qemu's -uuid switch, or via the container interface).
  16. All of the systemd-specific /etc/fstab mount options are now in the x-systemd-xyz format.
  17. To make it easy to find non-converted services we will now implicitly prefix all LSB and SysV init script descriptions with the strings "LSB:" resp. "SYSV:".
  18. We introduced /run and made it a hard dependency of systemd. This directory is now widely accepted and implemented on all relevant Linux distributions.
  19. systemctl can now execute all its operations remotely too (-H switch).
  20. We now ship systemd-nspawn, a really powerful tool that can be used to start containers for debugging, building and testing, much like chroot(1). It is useful to just get a shell inside a build tree, but is good enough to boot up a full system in it, too.
  21. If we query the user for a hard disk password at boot he may hit TAB to hide the asterisks we normally show for each key that is entered, for extra paranoia.
  22. We don't enable udev-settle.service anymore, which is only required for certain legacy software that still hasn't been updated to follow devices coming and going cleanly.
  23. We now include a tool that can plot boot speed graphs, similar to bootchartd, called systemd-analyze.
  24. At boot, we now initialize the kernel's binfmt_misc logic with the data from /etc/binfmt.d.
  25. systemctl now recognizes if it is run in a chroot() environment and will work accordingly (i.e. apply changes to the tree it is run in, instead of talking to the actual PID 1 for this). It also has a new --root= switch to work on an OS tree from outside of it.
  26. There's a new unit dependency type OnFailureIsolate= that allows entering a different target whenever a certain unit fails. For example, this is interesting to enter emergency mode if file system checks of crucial file systems failed.
  27. Socket units may now listen on Netlink sockets, special files from /proc and POSIX message queues, too.
  28. There's a new IgnoreOnIsolate= flag which may be used to ensure certain units are left untouched by isolation requests. There's a new IgnoreOnSnapshot= flag which may be used to exclude certain units from snapshot units when they are created.
  29. There's now small mechanism services for changing the local hostname and other host meta data, changing the system locale and console settings and the system clock.
  30. We now limit the capability bounding set for a number of our internal services by default.
  31. Plymouth may now be disabled globally with plymouth.enable=0 on the kernel command line.
  32. We now disallocate VTs when a getty finished running (and optionally other tools run on VTs). This adds extra security since it clears up the scrollback buffer so that subsequent users cannot get access to a user's session output.
  33. In socket units there are now options to control the IP_TRANSPARENT, SO_BROADCAST, SO_PASSCRED, SO_PASSSEC socket options.
  34. The receive and send buffers of socket units may now be set larger than the default system settings if needed by using SO_{RCV,SND}BUFFORCE.
  35. We now set the hardware timezone as one of the first things in PID 1, in order to avoid time jumps during normal userspace operation, and to guarantee sensible times on all generated logs. We also no longer save the system clock to the RTC on shutdown, assuming that this is done by the clock control tool when the user modifies the time, or automatically by the kernel if NTP is enabled.
  36. The SELinux directory got moved from /selinux to /sys/fs/selinux.
  37. We added a small service systemd-logind that keeps tracks of logged in users and their sessions. It creates control groups for them, implements the XDG_RUNTIME_DIR specification for them, maintains seats and device node ACLs and implements shutdown/idle inhibiting for clients. It auto-spawns gettys on all local VTs when the user switches to them (instead of starting six of them unconditionally), thus reducing the resource foot print by default. It has a D-Bus interface as well as a simple synchronous library interface. This mechanism obsoletes ConsoleKit which is now deprecated and should no longer be used.
  38. There's now full, automatic multi-seat support, and this is enabled in GNOME 3.4. Just by pluging in new seat hardware you get a new login screen on your seat's screen.
  39. There is now an option ControlGroupModify= to allow services to change the properties of their control groups dynamically, and one to make control groups persistent in the tree (ControlGroupPersistent=) so that they can be created and maintained by external tools.
  40. We now jump back into the initrd in shutdown, so that it can detach the root file system and the storage devices backing it. This allows (for the first time!) to reliably undo complex storage setups on shutdown and leave them in a clean state.
  41. systemctl now supports presets, a way for distributions and administrators to define their own policies on whether services should be enabled or disabled by default on package installation.
  42. systemctl now has high-level verbs for masking/unmasking units. There's also a new command (systemctl list-unit-files) for determining the list of all installed unit file files and whether they are enabled or not.
  43. We now apply sysctl variables to each new network device, as it appears. This makes /etc/sysctl.d compatible with hot-plug network devices.
  44. There's limited profiling for SELinux start-up perfomance built into PID 1.
  45. There's a new switch PrivateNetwork= to turn of any network access for a specific service.
  46. Service units may now include configuration for control group parameters. A few (such as MemoryLimit=) are exposed with high-level options, and all others are available via the generic ControlGroupAttribute= setting.
  47. There's now the option to mount certain cgroup controllers jointly at boot. We do this now for cpu and cpuacct by default.
  48. We added the journal and turned it on by default.
  49. All service output is now written to the Journal by default, regardless whether it is sent via syslog or simply written to stdout/stderr. Both message streams end up in the same location and are interleaved the way they should. All log messages even from the kernel and from early boot end up in the journal. Now, no service output gets unnoticed and is saved and indexed at the same location.
  50. systemctl status will now show the last 10 log lines for each service, directly from the journal.
  51. We now show the progress of fsck at boot on the console, again. We also show the much loved colorful [ OK ] status messages at boot again, as known from most SysV implementations.
  52. We merged udev into systemd.
  53. We implemented and documented interfaces to container managers and initrds for passing execution data to systemd. We also implemented and documented an interface for storage daemons that are required to back the root file system.
  54. There are two new options in service files to propagate reload requests between several units.
  55. systemd-cgls won't show kernel threads by default anymore, or show empty control groups.
  56. We added a new tool systemd-cgtop that shows resource usage of whole services in a top(1) like fasion.
  57. systemd may now supervise services in watchdog style. If enabled for a service the daemon daemon has to ping PID 1 in regular intervals or is otherwise considered failed (which might then result in restarting it, or even rebooting the machine, as configured). Also, PID 1 is capable of pinging a hardware watchdog. Putting this together, the hardware watchdogs PID 1 and PID 1 then watchdogs specific services. This is highly useful for high-availability servers as well as embedded machines. Since watchdog hardware is noawadays built into all modern chipsets (including desktop chipsets), this should hopefully help to make this a more widely used functionality.
  58. We added support for a new kernel command line option systemd.setenv= to set an environment variable system-wide.
  59. By default services which are started by systemd will have SIGPIPE set to ignored. The Unix SIGPIPE logic is used to reliably implement shell pipelines and when left enabled in services is usually just a source of bugs and problems.
  60. You may now configure the rate limiting that is applied to restarts of specific services. Previously the rate limiting parameters were hard-coded (similar to SysV).
  61. There's now support for loading the IMA integrity policy into the kernel early in PID 1, similar to how we already did it with the SELinux policy.
  62. There's now an official API to schedule and query scheduled shutdowns.
  63. We changed the license from GPL2+ to LGPL2.1+.
  64. We made systemd-detect-virt an official tool in the tool set. Since we already had code to detect certain VM and container environments we now added an official tool for administrators to make use of in shell scripts and suchlike.
  65. We documented numerous interfaces systemd introduced.

Much of the stuff above is already available in Fedora 15 and 16, or will be made available in the upcoming Fedora 17.

And that's it for now. There's a lot of other stuff in the git commits, but most of it is smaller and I will it thus spare you.

I'd like to thank everybody who contributed to systemd over the past years.

Thanks for your interest!

posted at: 00:17 | path: /projects | permanent link to this entry | comments

Tue, 10 Apr 2012

Control Groups vs. Control Groups

TL;DR: systemd does not require the performance-sensitive bits of Linux control groups enabled in the kernel. However, it does require some non-performance-sensitive bits of the control group logic.

In some areas of the community there's still some confusion about Linux control groups and their performance impact, and what precisely it is that systemd requires of them. In the hope to clear this up a bit, I'd like to point out a few things:

Control Groups are two things: (A) a way to hierarchally group and label processes, and (B) a way to then apply resource limits to these groups. systemd only requires the former (A), and not the latter (B). That means you can compile your kernel without any control group resource controllers (B) and systemd will work perfectly on it. However, if you in addition disable the grouping feature entirely (A) then systemd will loudly complain at boot and proceed only reluctantly with a big warning and in a limited functionality mode.

At compile time, the grouping/labelling feature in the kernel is enabled by CONFIG_CGROUPS=y, the individual controllers by CONFIG_CGROUP_FREEZER=y, CONFIG_CGROUP_DEVICE=y, CONFIG_CGROUP_CPUACCT=y, CONFIG_CGROUP_MEM_RES_CTLR=y, CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y, CONFIG_CGROUP_MEM_RES_CTLR_KMEM=y, CONFIG_CGROUP_PERF=y, CONFIG_CGROUP_SCHED=y, CONFIG_BLK_CGROUP=y, CONFIG_NET_CLS_CGROUP=y, CONFIG_NETPRIO_CGROUP=y. And since (as mentioned) we only need the former (A), not the latter (B) you may disable all of the latter options while enabling CONFIG_CGROUPS=y, if you want to run systemd on your system.

What about the performance impact of these options? Well, every bit of code comes at some price, so none of these options come entirely for free. However, the grouping feature (A) alters the general logic very little, it just sticks hierarchial labels on processes, and its impact is minimal since that is usually not in any hot path of the OS. This is different for the various controllers (B) which have a much bigger impact since they influence the resource management of the OS and are full of hot paths. This means that the kernel feature that systemd mandatorily requires (A) has a minimal effect on system performance, but the actually performance-sensitive features of control groups (B) are entirely optional.

On boot, systemd will mount all controller hierarchies it finds enabled in the kernel to individual directories below /sys/fs/cgroup/. This is the official place where kernel controllers are mounted to these days. The /sys/fs/cgroup/ mount point in the kernel was created precisely for this purpose. Since the control group controllers are a shared facility that might be used by a number of different subsystems a few projects have agreed on a set of rules in order to avoid that the various bits of code step on each other's toes when using these directories.

systemd will also maintain its own, private, controller-less, named control group hierarchy which is mounted to /sys/fs/cgroup/systemd/. This hierarchy is private property of systemd, and other software should not try to interfere with it. This hierarchy is how systemd makes use of the naming and grouping feature of control groups (A) without actually requiring any kernel controller enabled for that.

Now, you might notice that by default systemd does create per-service cgroups in the "cpu" controller if it finds it enabled in the kernel. This is entirely optional, however. We chose to make use of it by default to even out CPU usage between system services. Example: On a traditional web server machine Apache might end up having 100 CGI worker processes around, while MySQL only has 5 processes running. Without the use of the "cpu" controller this means that Apache all together ends up having 20x more CPU available than MySQL since the kernel tries to provide every process with the same amount of CPU time. On the other hand, if we add these two services to the "cpu" controller in individual groups by default, Apache and MySQL get the same amount of CPU, which we think is a good default.

Note that if the CPU controller is not enabled in the kernel systemd will not attempt to make use of the "cpu" hierarchy as described above. Also, even if it is enabled in the kernel it is trivial to tell systemd not to make use of it: Simply edit /etc/systemd/system.conf and set DefaultControllers= to the empty string.

Let's discuss a few frequently heard complaints regarding systemd's use of control groups:

I hope this explanation was useful for a reader or two! Thank you for your time!

posted at: 19:09 | path: /projects | permanent link to this entry | comments

GUADEC 2012 CFP Ending Soon!

In case you haven't submitted your talk proposal for GUADEC 2012 in A Coruña, Spain yet, hurry: the deadline is on April 14th, i.e. this saturday! Read der Call for Participation! Submit a proposal!

posted at: 17:40 | path: /projects | permanent link to this entry | comments

Wed, 28 Mar 2012

/tmp or not /tmp?

A number of Linux distributions have recently switched (or started switching) to /tmp on tmpfs by default (ArchLinux, Debian among others). Other distributions have plans/are discussing doing the same (Ubuntu, OpenSUSE). Since we believe this is a good idea and it's good to keep the delta between the distributions minimal we are proposing the same for Fedora 18, too. On Solaris a similar change has already been implemented in 1994 (and other Unixes have made a similar change long ago, too). Yet, not all of our software is written in a way that it works nicely together with /tmp on tmpfs.

Another Fedora feature (for Fedora 17) changed the semantics of /tmp for many system services to make them more secure, by isolating the /tmp namespaces of the various services. Handling of temporary files in /tmp has been security sensitive since it has been introduced since it traditionally has been a world writable, shared namespace and unless all user code safely uses randomized file names it is vulnerable to DoS attacks and worse.

In this blog story I'd like to shed some light on proper usage of /tmp and what your Linux application should use for what purpose. We'll not discuss why /tmp on tmpfs is a good idea, for that refer to the Fedora feature page. Here we'll just discuss what /tmp should be used for and for what it shouldn't be, as well as what should be used instead. All that in order to make sure your application remains compatible with these new features introduced to many newer Linux distributions.

/tmp is (as the name suggests) an area where temporary files applications require during operation may be placed. Of course, temporary files differ very much in their properties:

Traditionally, /tmp has not only been the place where actual temporary files are stored, but some software used to place (and often still continues to place) communication primitives such as sockets, FIFOs, shared memory there as well. Notably X11, but many others too. Usage of world-writable shared namespaces for communication purposes has always been problematic, since to establish communication you need stable names, but stable names open the doors for DoS attacks. This can be corrected partially, by establishing protected per-app directories for certain services during early boot (like we do for X11), but this only fixes the problem partially, since this only works correctly if every package installation is followed by a reboot.

Besides /tmp there are various other places where temporary files (or other files that traditionally have been stored in /tmp) can be stored. Here's a quick overview of the candidates:

Now that we have introduced the contestants, here's a rough guide how we suggest you (a Linux application developer) pick the right directory to use:

  1. You need a place to put your socket (or other communication primitive) and your code runs privileged: use a subdirectory beneath /run. (Or beneath /var/run for extra compatibility.)
  2. You need a place to put your socket (or other communication primitive) and your code runs unprivileged: use a subdirectory beneath $XDG_RUNTIME_DIR.
  3. You need a place to put your larger downloads and downloads in progress and run unprivileged: use $XDG_DOWNLOAD_DIR.
  4. You need a place to put cache files which should be persistent and run unprivileged: use $XDG_CACHE_HOME.
  5. Nothing of the above applies and you need to place a small file that needs no persistency: use $TMPDIR with a fallback on /tmp. And use mkstemp(), and mkdtemp() and nothing homegrown.
  6. Otherwise use $TMPDIR with a fallback on /var/tmp. Also use mkstemp()/mkdtemp().

Note that these rules above are only suggested by us. These rules take into account everything we know about this topic and avoid problems with current and future distributions, as far as we can see them. Please consider updating your projects to follow these rules, and keep them in mind if you write new code.

One thing we'd like to stress is that /tmp and /var/tmp more often than not are actually not the right choice for your usecase. There are valid uses of these directories, but quite often another directory might actually be the better place. So, be careful, consider the other options, but if you do go for /tmp or /var/tmp then at least make sure to use mkstemp()/mkdtemp().

Thank you for your interest!

Oh, and if you now complain that we don't understand Unix, and that we are morons and worse, then please read this again, and you might notice that this is just a best practice guide, not a specification we have written. Nothing that introduces anything new, just something that explains how things are.

If you want to complain about the tmp-on-tmpfs or ServicesPrivateTmp feature, then this is not the right place either, because this blog post is not really about that. Please direct this to fedora-devel instead. Thank you very much.


[1] Well, or to turn this around: unless you have a PhD in advanced Unixology and are not using mkstemp()/mkdtemp() but use /tmp nonetheless it's very likely you are writing vulnerable code.

posted at: 14:04 | path: /projects | permanent link to this entry | comments

Mon, 13 Feb 2012


One of the new configuration files systemd introduced is /etc/os-release. It replaces the multitude of per-distribution release files[1] with a single one. Yesterday we decided to drop support for systems lacking /etc/os-release in systemd since recently the majority of the big distributions adopted /etc/os-release and many small ones did, too[2]. It's our hope that by dropping support for non-compliant distributions we gently put some pressure on the remaining hold-outs to adopt this scheme as well.

I'd like to take the opportunity to explain a bit what the new file offers, why application developers should care, and why the distributions should adopt it. Of course, this file is pretty much a triviality in many ways, but I guess it's still one that deserves explanation.

So, you ask why this all?


There's already the lsb_release tool for this, why don't you just use that? Well, it's a very strange interface: a shell script you have to invoke (and hence spawn asynchronously from your C code), and it's not written to be extensible. It's an optional package in many distributions, and nothing we'd be happy to invoke as part of early boot in order to show a welcome message. (In times with sub-second userspace boot times we really don't want to invoke a huge shell script for a triviality like showing the welcome message). The lsb_release tool to us appears to be an attempt of abstracting distribution checks, where standardization of distribution checks is needed. It's simply a badly designed interface. In our opinion, it has its use as an interface to determine the LSB version itself, but not for checking the distribution or version.

Why haven't you adopted one of the generic release files, such as Fedora's /etc/system-release? Well, they are much nicer than lsb_release, so much is true. However, they are not extensible and are not really parsable, if the distribution needs to be identified programmatically or a specific version needs to be verified.

Why didn't you call this file /etc/bikeshed instead? The name /etc/os-release sucks! In a way, I think you kind of answered your own question there already.

Does this mean my distribution can now drop our equivalent of /etc/fedora-release? Unlikely, too much code exists that still checks for the individual release files, and you probably shouldn't break that. This new file makes things easy for applications, not for distributions: applications can now rely on a single file only, and use it in a nice way. Distributions will have to continue to ship the old files unless they are willing to break compatibility here.

This is so useless! My application needs to be compatible with distros from 1998, so how could I ever make use of the new file? I will have to continue using the old ones! True, if you need compatibility with really old distributions you do. But for new code this might not be an issue, and in general new APIs are new APIs. So if you decide to depend on it, you add a dependency on it. However, even if you need to stay compatible it might make sense to check /etc/os-release first and just fall back to the old files if it doesn't exist. The least it does for you is that you don't need 25+ open() attempts on modern distributions, but just one.

You evil people are forcing my beloved distro $XYZ to adopt your awful systemd schemes. I hate you! You hate too much, my friend. Also, I am pretty sure it's not difficult to see the benefit of this new file independently of systemd, and it's truly useful on systems without systemd, too.

I hate what you people do, can I just ignore this? Well, you really need to work on your constant feelings of hate, my friend. But, to a certain degree yes, you can ignore this for a while longer. But already, there are a number of applications making use of this file. You lose compatibility with those. Also, you are kinda working towards the further balkanization of the Linux landscape, but maybe that's your intention?

You guys add a new file because you think there are already too many? You guys are so confused! None of the existing files is generic and extensible enough to do what we want it to do. Hence we had to introduce a new one. We acknowledge the irony, however.

The file is extensible? Awesome! I want a new field XYZ= in it! Sure, it's extensible, and we are happy if distributions extend it. Please prefix your keys with your distribution's name however. Or even better: talk to us and we might be able update the documentation and make your field standard, if you convince us that it makes sense.

Anyway, to summarize all this: if you work on an application that needs to identify the OS it is being built on or is being run on, please consider making use of this new file, we created it for you. If you work on a distribution, and your distribution doesn't support this file yet, please consider adopting this file, too.

If you are working on a small/embedded distribution, or a legacy-free distribution we encourage you to adopt only this file and not establish any other per-distro release file.

Read the documentation for /etc/os-release.


[1] Yes, multitude, there's at least: /etc/redhat-release, /etc/SuSE-release, /etc/debian_version, /etc/arch-release, /etc/gentoo-release, /etc/slackware-version, /etc/frugalware-release, /etc/altlinux-release, /etc/mandriva-release, /etc/meego-release, /etc/angstrom-version, /etc/mageia-release. And some distributions even have multiple, for example Fedora has already four different files.

[2] To our knowledge at least OpenSUSE, Fedora, ArchLinux, Angstrom, Frugalware have adopted this. (This list is not comprehensive, there are probably more.)

posted at: 19:46 | path: /projects | permanent link to this entry | comments

Thu, 26 Jan 2012

The Case for the /usr Merge

One of the features of Fedora 17 is the /usr merge, put forward by Harald Hoyer and Kay Sievers[1]. In the time since this feature has been proposed repetitive discussions took place all over the various Free Software communities, and usually the same questions were asked: what the reasons behind this feature were, and whether it makes sense to adopt the same scheme for distribution XYZ, too.

Especially in the Non-Fedora world it appears to be socially unacceptable to actually have a look at the Fedora feature page (where many of the questions are already brought up and answered) which is very unfortunate. To improve the situation I spent some time today to summarize the reasons for the /usr merge independently. I'd hence like to direct you to this new page I put up which tries to summarize the reasons for this, with an emphasis on the compatibility point of view:

The Case for the /usr Merge

Note that even though this page is in the systemd wiki, what it covers is mostly orthogonal to systemd. systemd supports both systems with a merged /usr and with a split /usr, and the /usr merge should be interesting for non-systemd distributions as well.

Primarily I put this together to have a nice place to point all those folks who continue to write me annoyed emails, even though I am actually not even working on all of this...

Enjoy the read!


[1] And not actually by me, I am just a supportive spectator and am not doing any work on it. Unfortunately some tech press folks created the false impression I was behind this. But credit where credit is due, this is all Harald's and Kay's work.

posted at: 22:29 | path: /projects | permanent link to this entry | comments

Fri, 20 Jan 2012

Plumbers Wishlist, The Third Edition, a.k.a. "The Thank You Edition"

Last October we published a wishlist for plumbing related features we'd like to see added to the Linux kernel. Three months later it's time to publish a short update, and explain what has been implemented in the kernel, what people have started working on, and what's still missing.

The full, updated list is available on Google Docs.

In general, I must say that the list turned out to be a great success. It shows how awesome the Open Source community is: Just ask nicely and there's a good chance they'll fulfill your wishes! Thank you very much, Linux community!

We'd like to thank everybody who worked on any of the features on that list: Lucas De Marchi, Andi Kleen, Dan Ballard, Li Zefan, Kirill A. Shutemov, Davidlohr Bueso, Cong Wang, Lennart Poettering, Kay Sievers.

Of the items on the list 5 have been fully implemented and are already part of a released kernel, or already merged for inclusion for the next kernels being released.

For 4 further items patches have been posted, and I am hoping they'll get merged eventually. Davidlohr, Wang, Zefan, Kirill, it would be great if you'd continue working on your patches, as we think they are following the right approach[1] even if there was some opposition to them on LKML. So, please keep pushing to solve the outstanding issues and thanks for your work so far!


[1] Yes, I still believe that tmpfs quota should be implemented via resource limits, as everything else wouldn't work, as we don't want to implement complex and fragile userspace infrastructure to racily upload complex quota data for all current and future UIDs ever used on the system into each tmpfs mount point at mount time.

posted at: 21:26 | path: /projects | permanent link to this entry | comments

systemd for Administrators, Part XII

Here's the twelfth installment of my ongoing series on systemd for Administrators:

Securing Your Services

One of the core features of Unix systems is the idea of privilege separation between the different components of the OS. Many system services run under their own user IDs thus limiting what they can do, and hence the impact they may have on the OS in case they get exploited.

This kind of privilege separation only provides very basic protection however, since in general system services run this way can still do at least as much as a normal local users, though not as much as root. For security purposes it is however very interesting to limit even further what services can do, and shut them off a couple of things that normal users are allowed to do.

A great way to limit the impact of services is by employing MAC technologies such as SELinux. If you are interested to secure down your server, running SELinux is a very good idea. systemd enables developers and administrators to apply additional restrictions to local services independently of a MAC. Thus, regardless whether you are able to make use of SELinux you may still enforce certain security limits on your services.

In this iteration of the series we want to focus on a couple of these security features of systemd and how to make use of them in your services. These features take advantage of a couple of Linux-specific technologies that have been available in the kernel for a long time, but never have been exposed in a widely usable fashion. These systemd features have been designed to be as easy to use as possible, in order to make them attractive to administrators and upstream developers:

All options described here are documented in systemd's man pages, notably systemd.exec(5). Please consult these man pages for further details.

All these options are available on all systemd systems, regardless if SELinux or any other MAC is enabled, or not.

All these options are relatively cheap, so if in doubt use them. Even if you might think that your service doesn't write to /tmp and hence enabling PrivateTmp=yes (as described below) might not be necessary, due to today's complex software it's still beneficial to enable this feature, simply because libraries you link to (and plug-ins to those libraries) which you do not control might need temporary files after all. Example: you never know what kind of NSS module your local installation has enabled, and what that NSS module does with /tmp.

These options are hopefully interesting both for administrators to secure their local systems, and for upstream developers to ship their services secure by default. We strongly encourage upstream developers to consider using these options by default in their upstream service units. They are very easy to make use of and have major benefits for security.

Isolating Services from the Network

A very simple but powerful configuration option you may use in systemd service definitions is PrivateNetwork=:


With this simple switch a service and all the processes it consists of are entirely disconnected from any kind of networking. Network interfaces became unavailable to the processes, the only one they'll see is the loopback device "lo", but it is isolated from the real host loopback. This is a very powerful protection from network attacks.

Caveat: Some services require the network to be operational. Of course, nobody would consider using PrivateNetwork=yes on a network-facing service such as Apache. However even for non-network-facing services network support might be necessary and not always obvious. Example: if the local system is configured for an LDAP-based user database doing glibc name lookups with calls such as getpwnam() might end up resulting in network access. That said, even in those cases it is more often than not OK to use PrivateNetwork=yes since user IDs of system service users are required to be resolvable even without any network around. That means as long as the only user IDs your service needs to resolve are below the magic 1000 boundary using PrivateNetwork=yes should be OK.

Internally, this feature makes use of network namespaces of the kernel. If enabled a new network namespace is opened and only the loopback device configured in it.

Service-Private /tmp

Another very simple but powerful configuration switch is PrivateTmp=:


If enabled this option will ensure that the /tmp directory the service will see is private and isolated from the host system's /tmp. /tmp traditionally has been a shared space for all local services and users. Over the years it has been a major source of security problems for a multitude of services. Symlink attacks and DoS vulnerabilities due to guessable /tmp temporary files are common. By isolating the service's /tmp from the rest of the host, such vulnerabilities become moot.

For Fedora 17 a feature has been accepted in order to enable this option across a large number of services.

Caveat: Some services actually misuse /tmp as a location for IPC sockets and other communication primitives, even though this is almost always a vulnerability (simply because if you use it for communication you need guessable names, and guessable names make your code vulnerable to DoS and symlink attacks) and /run is the much safer replacement for this, simply because it is not a location writable to unprivileged processes. For example, X11 places it's communication sockets below /tmp (which is actually secure -- though still not ideal -- in this exception since it does so in a safe subdirectory which is created at early boot.) Services which need to communicate via such communication primitives in /tmp are no candidates for PrivateTmp=. Thankfully these days only very few services misusing /tmp like this remain.

Internally, this feature makes use of file system namespaces of the kernel. If enabled a new file system namespace is opened inheritng most of the host hierarchy with the exception of /tmp.

Making Directories Appear Read-Only or Inaccessible to Services

With the ReadOnlyDirectories= and InaccessibleDirectories= options it is possible to make the specified directories inaccessible for writing resp. both reading and writing to the service:


With these two configuration lines the whole tree below /home becomes inaccessible to the service (i.e. the directory will appear empty and with 000 access mode), and the tree below /var becomes read-only.

Caveat: Note that ReadOnlyDirectories= currently is not recursively applied to submounts of the specified directories (i.e. mounts below /var in the example above stay writable). This is likely to get fixed soon.

Internally, this is also implemented based on file system namspaces.

Taking Away Capabilities From Services

Another very powerful security option in systemd is CapabilityBoundingSet= which allows to limit in a relatively fine grained fashion which kernel capabilities a service started retains:

CapabilityBoundingSet=CAP_CHOWN CAP_KILL

In the example above only the CAP_CHOWN and CAP_KILL capabilities are retained by the service, and the service and any processes it might create have no chance to ever acquire any other capabilities again, not even via setuid binaries. The list of currently defined capabilities is available in capabilities(7). Unfortunately some of the defined capabilities are overly generic (such as CAP_SYS_ADMIN), however they are still a very useful tool, in particular for services that otherwise run with full root privileges.

To identify precisely which capabilities are necessary for a service to run cleanly is not always easy and requires a bit of testing. To simplify this process a bit, it is possible to blacklist certain capabilities that are definitely not needed instead of whitelisting all that might be needed. Example: the CAP_SYS_PTRACE is a particularly powerful and security relevant capability needed for the implementation of debuggers, since it allows introspecting and manipulating any local process on the system. A service like Apache obviously has no business in being a debugger for other processes, hence it is safe to remove the capability from it:


The ~ character the value assignment here is prefixed with inverts the meaning of the option: instead of listing all capabalities the service will retain you may list the ones it will not retain.

Caveat: Some services might react confused if certain capabilities are made unavailable to them. Thus when determining the right set of capabilities to keep around you need to do this carefully, and it might be a good idea to talk to the upstream maintainers since they should know best which operations a service might need to run successfully.

Caveat 2: Capabilities are not a magic wand. You probably want to combine them and use them in conjunction with other security options in order to make them truly useful.

To easily check which processes on your system retain which capabilities use the pscap tool from the libcap-ng-utils package.

Making use of systemd's CapabilityBoundingSet= option is often a simple, discoverable and cheap replacement for patching all system daemons individually to control the capability bounding set on their own.

Disallowing Forking, Limiting File Creation for Services

Resource Limits may be used to apply certain security limits on services being run. Primarily, resource limits are useful for resource control (as the name suggests...) not so much access control. However, two of them can be useful to disable certain OS features: RLIMIT_NPROC and RLIMIT_FSIZE may be used to disable forking and disable writing of any files with a size > 0:


Note that this will work only if the service in question drops privileges and runs under a (non-root) user ID of its own or drops the CAP_SYS_RESOURCE capability, for example via CapabilityBoundingSet= as discussed above. Without that a process could simply increase the resource limit again thus voiding any effect.

Caveat: LimitFSIZE= is pretty brutal. If the service attempts to write a file with a size > 0, it will immeidately be killed with the SIGXFSZ which unless caught terminates the process. Also, creating files with size 0 is still allowed, even if this option is used.

For more information on these and other resource limits, see setrlimit(2).

Controlling Device Node Access of Services

Devices nodes are an important interface to the kernel and its drivers. Since drivers tend to get much less testing and security checking than the core kernel they often are a major entry point for security hacks. systemd allows you to control access to devices individually for each service:

DeviceAllow=/dev/null rw

This will limit access to /dev/null and only this device node, disallowing access to any other device nodes.

The feature is implemented on top of the devices cgroup controller.

Other Options

Besides the easy to use options above there are a number of other security relevant options available. However they usually require a bit of preparation in the service itself and hence are probably primarily useful for upstream developers. These options are RootDirectory= (to set up chroot() environments for a service) as well as User= and Group= to drop privileges to the specified user and group. These options are particularly useful to greatly simplify writing daemons, where all the complexities of securely dropping privileges can be left to systemd, and kept out of the daemons themselves.

If you are wondering why these options are not enabled by default: some of them simply break seamntics of traditional Unix, and to maintain compatibility we cannot enable them by default. e.g. since traditional Unix enforced that /tmp was a shared namespace, and processes could use it for IPC we cannot just go and turn that off globally, just because /tmp's role in IPC is now replaced by /run.

And that's it for now. If you are working on unit files for upstream or in your distribution, please consider using one or more of the options listed above. If you service is secure by default by taking advantage of these options this will help not only your users but also make the Internet a safer place.

posted at: 02:26 | path: /projects | permanent link to this entry | comments

Mon, 16 Jan 2012

PulseAudio vs. AudioFlinger

Arun put an awesome article up, detailing how PulseAudio compares to Android's AudioFlinger in terms of power consumption and suchlike. Suffice to say, PulseAudio rocks, but go and read the whole thing, it's worth it.

Apparently, AudioFlinger is a great choice if you want to shorten your battery life.

posted at: 16:31 | path: /projects | permanent link to this entry | comments

Fri, 18 Nov 2011

Introducing the Journal

In the past weeks we have been working on a major new addition to systemd that will hopefully positively change the Linux ecosystem in a number of ways. But see for yourself, check out the full explanation on what we have implemented on the design document we put up on Google Docs.

posted at: 16:28 | path: /projects | permanent link to this entry | comments

Mon, 07 Nov 2011

Kernel Hackers Panel

At LinuxCon Europe/ELCE I had the chance to moderate the kernel hackers panel with Linus Torvalds, Alan Cox, Paul McKenney and Thomas Gleixner on stage. I like to believe it went quite well, but check it out for yourself, as a video recording is now available online:

For me personally I think the most notable topic covered was Control Groups, and the clarification that they are something that is needed even though their implementation right now is in many ways less than perfect. But in the end there is no reasonable way around it, and much like SMP, technology that complicates things substantially but is ultimately unavoidable.

Other videos from ELCE are online now, too.

posted at: 16:53 | path: /projects | permanent link to this entry | comments

Tue, 01 Nov 2011


At the Kernel Summit in Prague last week Kay Sievers and I lead a session on developing shared userspace libraries, for kernel hackers. More and more userspace interfaces of the kernel (for example many which deal with storage, audio, resource management, security, file systems or a number of other subsystems) nowadays rely on a dedicated userspace component. As people who work primarily in the plumbing layer of the Linux OS we noticed over and over again that these libraries written by people who usually are at home on the kernel side of things make the same mistakes repeatedly, thus making life for the users of the libraries unnecessarily difficult. In our session we tried to point out a number of these things, and in particular places where the usual kernel hacking style translates badly into userspace shared library hacking. Our hope is that maybe a few kernel developers have a look at our list of recommendations and consider the points we are raising.

To make things easy we have put together an example skeleton library we dubbed libabc, whose README file includes all our points in terse form. It's available on kernel.org:

The git repository and the README.

This list of recommendations draws inspiration from David Zeuthen's and Ulrich Drepper's well known papers on the topic of writing shared libraries. In the README linked above we try to distill this wealth of information into a terse list of recommendations, with a couple of additions and with a strict focus on a kernel hacker background.

Please have a look, and even if you are not a kernel hacker there might be something useful to know in it, especially if you work on the lower layers of our stack.

If you have any questions or additions, just ping us, or comment below!

posted at: 01:46 | path: /projects | permanent link to this entry | comments

Sun, 23 Oct 2011


If you make it to Prague the coming week for the LinuxCon/ELCE/GStreamer/Kernel Summit/... superconference, make sure not to miss:

All of that at the Clarion Hotel. See you in Prague!

posted at: 01:31 | path: /projects | permanent link to this entry | comments

Thu, 20 Oct 2011

Plumbers Wishlist, The Second Edition

Two weeks ago we published a Plumber's Wishlist for Linux. So far, this has already created lively discussions in the community (as reported on LWN among others), and patches for a few of the items listed have already been posted (thanks a lot to those who worked on this, your contributions are much appreciated!).

We have now prepared a second version of the wish list. It includes a number of additions (tmpfs quota! hostname change notifications! and more!) and updates to the previous items, including links to patches, and references to other interesting material.

We hope to update this wishlist from time, so stay tuned!

And now, go and read the new wishlist!

posted at: 20:41 | path: /projects | permanent link to this entry | comments

Mon, 17 Oct 2011

Google doesn't like my name

Nice one, Google suspended my Google+ account because I created it under, well, my name, which is "Lennart Poettering", and Google+ thinks that wasn't my name, even though it says so in my passport, and almost every document I own and I was never aware I had any other name. This is ricidulous. Google, give me my name back! This is a really uncool move.

posted at: 18:50 | path: /projects | permanent link to this entry | comments

Your Questions for the Kernel Developer Panel at LinuxCon in Prague

I am currently collecting questions for the kernel developer panel at LinuxCon in Prague. If there's something you'd like the panelists to respond to, please post it on the thread, and I'll see what I can do. Thank you!

posted at: 15:38 | path: /projects | permanent link to this entry | comments

It should be obvious but in case it isn't: the opinions reflected here are my own. They are not the views of my employer, or Ronald McDonald, or anyone else.

Please note that I take the liberty to delete any comments posted here that I deem inappropriate, off-topic, or insulting. And I excercise this liberty quite agressively. So yes, if you comment here, I might censor you. If you don't want to be censored you are welcome to comment on your own blog instead.

Lennart Poettering <mzoybt (at) 0pointer (dot) net>
Syndicated on Planet GNOME, Planet Fedora, planet.freedesktop.org, Planet Debian Upstream. feed RSS 0.91, RSS 2.0
Archives: 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013

Valid XHTML 1.0 Strict!   Valid CSS!