Index | Archives | Atom Feed | RSS Feed

systemd.conf 2015 CfP REMINDER

LAST REMINDER! systemd.conf 2015 Call for Presentations ends August 31st!

Here's the last reminder that the systemd.conf 2015 CfP ends on August 31st 11:59:59pm Central European Time (that's monday next week)! Make sure to submit your proposals until then!

Please submit your proposals on our website!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.


First Round of systemd.conf 2015 Sponsors

First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf 2015 sponsors!

Our first Silver sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS's commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here https://coreos.com/, Tectonic here, https://tectonic.com/ or follow CoreOS on Twitter @coreoslinux.

A Bronze sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world's top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We'd like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We'll shortly announce our second round of sponsors, please stay tuned!

If you'd like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

For further details about systemd.conf consult the conference website.


systemd.conf 2015 Call for Presentations

REMINDER! systemd.conf 2015 Call for Presentations ends August 31st!

We'd like to remind you that the systemd.conf 2015 Call for Presentations ends on August 31st! Please submit your presentation proposals before that data on our website.

We are specifically interested in submissions from projects and vendors building today's and tomorrow's products, services and devices with systemd. We'd like to learn about the problems you encounter and the benefits you see! Hence, if you work for a company using systemd, please submit a presentation!

We are also specifically interested in submissions from downstream distribution maintainers of systemd! If you develop or maintain systemd packages in a distribution, please submit a presentation reporting about the state, future and the problems of systemd packaging so that we can improve downstream collaboration!

And of course, all talks regarding systemd usage in containers, in the cloud, on servers, on the desktop, in mobile and in embedded are highly welcome! Talks about systemd networking and kdbus IPC are very welcome too!

Please submit your presentations until August 31st!

And don't forget to register for the conference! Only a limited number of registrations are available due to space constraints! Register here!.

Also, limited travel and entry fee sponsorship is available for community contributors. Please contact us for details!

For further details about the CfP consult the CfP page.

For further details about systemd.conf consult the conference website.


Announcing systemd.conf 2015

Announcing systemd.conf 2015

We are happy to announce the inaugural systemd.conf 2015 conference of the systemd project.

The conference takes place November 5th-7th, 2015 in Berlin, Germany.

Only a limited number of tickets are available, hence make sure to sign up quickly.

For further details consult the conference website.


The new sd-bus API of systemd

With the new v221 release of systemd we are declaring the sd-bus API shipped with systemd stable. sd-bus is our minimal D-Bus IPC C library, supporting as back-ends both classic socket-based D-Bus and kdbus. The library has been been part of systemd for a while, but has only been used internally, since we wanted to have the liberty to still make API changes without affecting external consumers of the library. However, now we are confident to commit to a stable API for it, starting with v221.

In this blog story I hope to provide you with a quick overview on sd-bus, a short reiteration on D-Bus and its concepts, as well as a few simple examples how to write D-Bus clients and services with it.

What is D-Bus again?

Let's start with a quick reminder what D-Bus actually is: it's a powerful, generic IPC system for Linux and other operating systems. It knows concepts like buses, objects, interfaces, methods, signals, properties. It provides you with fine-grained access control, a rich type system, discoverability, introspection, monitoring, reliable multicasting, service activation, file descriptor passing, and more. There are bindings for numerous programming languages that are used on Linux.

D-Bus has been a core component of Linux systems since more than 10 years. It is certainly the most widely established high-level local IPC system on Linux. Since systemd's inception it has been the IPC system it exposes its interfaces on. And even before systemd, it was the IPC system Upstart used to expose its interfaces. It is used by GNOME, by KDE and by a variety of system components.

D-Bus refers to both a specification, and a reference implementation. The reference implementation provides both a bus server component, as well as a client library. While there are multiple other, popular reimplementations of the client library – for both C and other programming languages –, the only commonly used server side is the one from the reference implementation. (However, the kdbus project is working on providing an alternative to this server implementation as a kernel component.)

D-Bus is mostly used as local IPC, on top of AF_UNIX sockets. However, the protocol may be used on top of TCP/IP as well. It does not natively support encryption, hence using D-Bus directly on TCP is usually not a good idea. It is possible to combine D-Bus with a transport like ssh in order to secure it. systemd uses this to make many of its APIs accessible remotely.

A frequently asked question about D-Bus is why it exists at all, given that AF_UNIX sockets and FIFOs already exist on UNIX and have been used for a long time successfully. To answer this question let's make a comparison with popular web technology of today: what AF_UNIX/FIFOs are to D-Bus, TCP is to HTTP/REST. While AF_UNIX sockets/FIFOs only shovel raw bytes between processes, D-Bus defines actual message encoding and adds concepts like method call transactions, an object system, security mechanisms, multicasting and more.

From our 10year+ experience with D-Bus we know today that while there are some areas where we can improve things (and we are working on that, both with kdbus and sd-bus), it generally appears to be a very well designed system, that stood the test of time, aged well and is widely established. Today, if we'd sit down and design a completely new IPC system incorporating all the experience and knowledge we gained with D-Bus, I am sure the result would be very close to what D-Bus already is.

Or in short: D-Bus is great. If you hack on a Linux project and need a local IPC, it should be your first choice. Not only because D-Bus is well designed, but also because there aren't many alternatives that can cover similar functionality.

Where does sd-bus fit in?

Let's discuss why sd-bus exists, how it compares with the other existing C D-Bus libraries and why it might be a library to consider for your project.

For C, there are two established, popular D-Bus libraries: libdbus, as it is shipped in the reference implementation of D-Bus, as well as GDBus, a component of GLib, the low-level tool library of GNOME.

Of the two libdbus is the much older one, as it was written at the time the specification was put together. The library was written with a focus on being portable and to be useful as back-end for higher-level language bindings. Both of these goals required the API to be very generic, resulting in a relatively baroque, hard-to-use API that lacks the bits that make it easy and fun to use from C. It provides the building blocks, but few tools to actually make it straightforward to build a house from them. On the other hand, the library is suitable for most use-cases (for example, it is OOM-safe making it suitable for writing lowest level system software), and is portable to operating systems like Windows or more exotic UNIXes.

GDBus is a much newer implementation. It has been written after considerable experience with using a GLib/GObject wrapper around libdbus. GDBus is implemented from scratch, shares no code with libdbus. Its design differs substantially from libdbus, it contains code generators to make it specifically easy to expose GObject objects on the bus, or talking to D-Bus objects as GObject objects. It translates D-Bus data types to GVariant, which is GLib's powerful data serialization format. If you are used to GLib-style programming then you'll feel right at home, hacking D-Bus services and clients with it is a lot simpler than using libdbus.

With sd-bus we now provide a third implementation, sharing no code with either libdbus or GDBus. For us, the focus was on providing kind of a middle ground between libdbus and GDBus: a low-level C library that actually is fun to work with, that has enough syntactic sugar to make it easy to write clients and services with, but on the other hand is more low-level than GDBus/GLib/GObject/GVariant. To be able to use it in systemd's various system-level components it needed to be OOM-safe and minimal. Another major point we wanted to focus on was supporting a kdbus back-end right from the beginning, in addition to the socket transport of the original D-Bus specification ("dbus1"). In fact, we wanted to design the library closer to kdbus' semantics than to dbus1's, wherever they are different, but still cover both transports nicely. In contrast to libdbus or GDBus portability is not a priority for sd-bus, instead we try to make the best of the Linux platform and expose specific Linux concepts wherever that is beneficial. Finally, performance was also an issue (though a secondary one): neither libdbus nor GDBus will win any speed records. We wanted to improve on performance (throughput and latency) -- but simplicity and correctness are more important to us. We believe the result of our work delivers our goals quite nicely: the library is fun to use, supports kdbus and sockets as back-end, is relatively minimal, and the performance is substantially better than both libdbus and GDBus.

To decide which of the three APIs to use for you C project, here are short guidelines:

  • If you hack on a GLib/GObject project, GDBus is definitely your first choice.

  • If portability to non-Linux kernels -- including Windows, Mac OS and other UNIXes -- is important to you, use either GDBus (which more or less means buying into GLib/GObject) or libdbus (which requires a lot of manual work).

  • Otherwise, sd-bus would be my recommended choice.

(I am not covering C++ specifically here, this is all about plain C only. But do note: if you use Qt, then QtDBus is the D-Bus API of choice, being a wrapper around libdbus.)

Introduction to D-Bus Concepts

To the uninitiated D-Bus usually appears to be a relatively opaque technology. It uses lots of concepts that appear unnecessarily complex and redundant on first sight. But actually, they make a lot of sense. Let's have a look:

  • A bus is where you look for IPC services. There are usually two kinds of buses: a system bus, of which there's exactly one per system, and which is where you'd look for system services; and a user bus, of which there's one per user, and which is where you'd look for user services, like the address book service or the mail program. (Originally, the user bus was actually a session bus -- so that you get multiple of them if you log in many times as the same user --, and on most setups it still is, but we are working on moving things to a true user bus, of which there is only one per user on a system, regardless how many times that user happens to log in.)

  • A service is a program that offers some IPC API on a bus. A service is identified by a name in reverse domain name notation. Thus, the org.freedesktop.NetworkManager service on the system bus is where NetworkManager's APIs are available and org.freedesktop.login1 on the system bus is where systemd-logind's APIs are exposed.

  • A client is a program that makes use of some IPC API on a bus. It talks to a service, monitors it and generally doesn't provide any services on its own. That said, lines are blurry and many services are also clients to other services. Frequently the term peer is used as a generalization to refer to either a service or a client.

  • An object path is an identifier for an object on a specific service. In a way this is comparable to a C pointer, since that's how you generally reference a C object, if you hack object-oriented programs in C. However, C pointers are just memory addresses, and passing memory addresses around to other processes would make little sense, since they of course refer to the address space of the service, the client couldn't make sense of it. Thus, the D-Bus designers came up with the object path concept, which is just a string that looks like a file system path. Example: /org/freedesktop/login1 is the object path of the 'manager' object of the org.freedesktop.login1 service (which, as we remember from above, is still the service systemd-logind exposes). Because object paths are structured like file system paths they can be neatly arranged in a tree, so that you end up with a venerable tree of objects. For example, you'll find all user sessions systemd-logind manages below the /org/freedesktop/login1/session sub-tree, for example called /org/freedesktop/login1/session/_7, /org/freedesktop/login1/session/_55 and so on. How services precisely label their objects and arrange them in a tree is completely up to the developers of the services.

  • Each object that is identified by an object path has one or more interfaces. An interface is a collection of signals, methods, and properties (collectively called members), that belong together. The concept of a D-Bus interface is actually pretty much identical to what you know from programming languages such as Java, which also know an interface concept. Which interfaces an object implements are up the developers of the service. Interface names are in reverse domain name notation, much like service names. (Yes, that's admittedly confusing, in particular since it's pretty common for simpler services to reuse the service name string also as an interface name.) A couple of interfaces are standardized though and you'll find them available on many of the objects offered by the various services. Specifically, those are org.freedesktop.DBus.Introspectable, org.freedesktop.DBus.Peer and org.freedesktop.DBus.Properties.

  • An interface can contain methods. The word "method" is more or less just a fancy word for "function", and is a term used pretty much the same way in object-oriented languages such as Java. The most common interaction between D-Bus peers is that one peer invokes one of these methods on another peer and gets a reply. A D-Bus method takes a couple of parameters, and returns others. The parameters are transmitted in a type-safe way, and the type information is included in the introspection data you can query from each object. Usually, method names (and the other member types) follow a CamelCase syntax. For example, systemd-logind exposes an ActivateSession method on the org.freedesktop.login1.Manager interface that is available on the /org/freedesktop/login1 object of the org.freedesktop.login1 service.

  • A signature describes a set of parameters a function (or signal, property, see below) takes or returns. It's a series of characters that each encode one parameter by its type. The set of types available is pretty powerful. For example, there are simpler types like s for string, or u for 32bit integer, but also complex types such as as for an array of strings or a(sb) for an array of structures consisting of one string and one boolean each. See the D-Bus specification for the full explanation of the type system. The ActivateSession method mentioned above takes a single string as parameter (the parameter signature is hence s), and returns nothing (the return signature is hence the empty string). Of course, the signature can get a lot more complex, see below for more examples.

  • A signal is another member type that the D-Bus object system knows. Much like a method it has a signature. However, they serve different purposes. While in a method call a single client issues a request on a single service, and that service sends back a response to the client, signals are for general notification of peers. Services send them out when they want to tell one or more peers on the bus that something happened or changed. In contrast to method calls and their replies they are hence usually broadcast over a bus. While method calls/replies are used for duplex one-to-one communication, signals are usually used for simplex one-to-many communication (note however that that's not a requirement, they can also be used one-to-one). Example: systemd-logind broadcasts a SessionNew signal from its manager object each time a user logs in, and a SessionRemoved signal every time a user logs out.

  • A property is the third member type that the D-Bus object system knows. It's similar to the property concept known by languages like C#. Properties also have a signature, and are more or less just variables that an object exposes, that can be read or altered by clients. Example: systemd-logind exposes a property Docked of the signature b (a boolean). It reflects whether systemd-logind thinks the system is currently in a docking station of some form (only applies to laptops …).

So much for the various concepts D-Bus knows. Of course, all these new concepts might be overwhelming. Let's look at them from a different perspective. I assume many of the readers have an understanding of today's web technology, specifically HTTP and REST. Let's try to compare the concept of a HTTP request with the concept of a D-Bus method call:

  • A HTTP request you issue on a specific network. It could be the Internet, or it could be your local LAN, or a company VPN. Depending on which network you issue the request on, you'll be able to talk to a different set of servers. This is not unlike the "bus" concept of D-Bus.

  • On the network you then pick a specific HTTP server to talk to. That's roughly comparable to picking a service on a specific bus.

  • On the HTTP server you then ask for a specific URL. The "path" part of the URL (by which I mean everything after the host name of the server, up to the last "/") is pretty similar to a D-Bus object path.

  • The "file" part of the URL (by which I mean everything after the last slash, following the path, as described above), then defines the actual call to make. In D-Bus this could be mapped to an interface and method name.

  • Finally, the parameters of a HTTP call follow the path after the "?", they map to the signature of the D-Bus call.

Of course, comparing an HTTP request to a D-Bus method call is a bit comparing apples and oranges. However, I think it's still useful to get a bit of a feeling of what maps to what.

From the shell

So much about the concepts and the gray theory behind them. Let's make this exciting, let's actually see how this feels on a real system.

Since a while systemd has included a tool busctl that is useful to explore and interact with the D-Bus object system. When invoked without parameters, it will show you a list of all peers connected to the system bus. (Use --user to see the peers of your user bus instead):

$ busctl
NAME                                       PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRIPTION
:1.1                                         1 systemd         root             :1.1          -                         -          -
:1.11                                      705 NetworkManager  root             :1.11         NetworkManager.service    -          -
:1.14                                      744 gdm             root             :1.14         gdm.service               -          -
:1.4                                       708 systemd-logind  root             :1.4          systemd-logind.service    -          -
:1.7200                                  17563 busctl          lennart          :1.7200       session-1.scope           1          -
[…]
org.freedesktop.NetworkManager             705 NetworkManager  root             :1.11         NetworkManager.service    -          -
org.freedesktop.login1                     708 systemd-logind  root             :1.4          systemd-logind.service    -          -
org.freedesktop.systemd1                     1 systemd         root             :1.1          -                         -          -
org.gnome.DisplayManager                   744 gdm             root             :1.14         gdm.service               -          -
[…]

(I have shortened the output a bit, to make keep things brief).

The list begins with a list of all peers currently connected to the bus. They are identified by peer names like ":1.11". These are called unique names in D-Bus nomenclature. Basically, every peer has a unique name, and they are assigned automatically when a peer connects to the bus. They are much like an IP address if you so will. You'll notice that a couple of peers are already connected, including our little busctl tool itself as well as a number of system services. The list then shows all actual services on the bus, identified by their service names (as discussed above; to discern them from the unique names these are also called well-known names). In many ways well-known names are similar to DNS host names, i.e. they are a friendlier way to reference a peer, but on the lower level they just map to an IP address, or in this comparison the unique name. Much like you can connect to a host on the Internet by either its host name or its IP address, you can also connect to a bus peer either by its unique or its well-known name. (Note that each peer can have as many well-known names as it likes, much like an IP address can have multiple host names referring to it).

OK, that's already kinda cool. Try it for yourself, on your local machine (all you need is a recent, systemd-based distribution).

Let's now go the next step. Let's see which objects the org.freedesktop.login1 service actually offers:

$ busctl tree org.freedesktop.login1
└─/org/freedesktop/login1
  ├─/org/freedesktop/login1/seat
  │ ├─/org/freedesktop/login1/seat/seat0
  │ └─/org/freedesktop/login1/seat/self
  ├─/org/freedesktop/login1/session
  │ ├─/org/freedesktop/login1/session/_31
  │ └─/org/freedesktop/login1/session/self
  └─/org/freedesktop/login1/user
    ├─/org/freedesktop/login1/user/_1000
    └─/org/freedesktop/login1/user/self

Pretty, isn't it? What's actually even nicer, and which the output does not show is that there's full command line completion available: as you press TAB the shell will auto-complete the service names for you. It's a real pleasure to explore your D-Bus objects that way!

The output shows some objects that you might recognize from the explanations above. Now, let's go further. Let's see what interfaces, methods, signals and properties one of these objects actually exposes:

$ busctl introspect org.freedesktop.login1 /org/freedesktop/login1/session/_31
NAME                                TYPE      SIGNATURE RESULT/VALUE                             FLAGS
org.freedesktop.DBus.Introspectable interface -         -                                        -
.Introspect                         method    -         s                                        -
org.freedesktop.DBus.Peer           interface -         -                                        -
.GetMachineId                       method    -         s                                        -
.Ping                               method    -         -                                        -
org.freedesktop.DBus.Properties     interface -         -                                        -
.Get                                method    ss        v                                        -
.GetAll                             method    s         a{sv}                                    -
.Set                                method    ssv       -                                        -
.PropertiesChanged                  signal    sa{sv}as  -                                        -
org.freedesktop.login1.Session      interface -         -                                        -
.Activate                           method    -         -                                        -
.Kill                               method    si        -                                        -
.Lock                               method    -         -                                        -
.PauseDeviceComplete                method    uu        -                                        -
.ReleaseControl                     method    -         -                                        -
.ReleaseDevice                      method    uu        -                                        -
.SetIdleHint                        method    b         -                                        -
.TakeControl                        method    b         -                                        -
.TakeDevice                         method    uu        hb                                       -
.Terminate                          method    -         -                                        -
.Unlock                             method    -         -                                        -
.Active                             property  b         true                                     emits-change
.Audit                              property  u         1                                        const
.Class                              property  s         "user"                                   const
.Desktop                            property  s         ""                                       const
.Display                            property  s         ""                                       const
.Id                                 property  s         "1"                                      const
.IdleHint                           property  b         true                                     emits-change
.IdleSinceHint                      property  t         1434494624206001                         emits-change
.IdleSinceHintMonotonic             property  t         0                                        emits-change
.Leader                             property  u         762                                      const
.Name                               property  s         "lennart"                                const
.Remote                             property  b         false                                    const
.RemoteHost                         property  s         ""                                       const
.RemoteUser                         property  s         ""                                       const
.Scope                              property  s         "session-1.scope"                        const
.Seat                               property  (so)      "seat0" "/org/freedesktop/login1/seat... const
.Service                            property  s         "gdm-autologin"                          const
.State                              property  s         "active"                                 -
.TTY                                property  s         "/dev/tty1"                              const
.Timestamp                          property  t         1434494630344367                         const
.TimestampMonotonic                 property  t         34814579                                 const
.Type                               property  s         "x11"                                    const
.User                               property  (uo)      1000 "/org/freedesktop/login1/user/_1... const
.VTNr                               property  u         1                                        const
.Lock                               signal    -         -                                        -
.PauseDevice                        signal    uus       -                                        -
.ResumeDevice                       signal    uuh       -                                        -
.Unlock                             signal    -         -                                        -

As before, the busctl command supports command line completion, hence both the service name and the object path used are easily put together on the shell simply by pressing TAB. The output shows the methods, properties, signals of one of the session objects that are currently made available by systemd-logind. There's a section for each interface the object knows. The second column tells you what kind of member is shown in the line. The third column shows the signature of the member. In case of method calls that's the input parameters, the fourth column shows what is returned. For properties, the fourth column encodes the current value of them.

So far, we just explored. Let's take the next step now: let's become active - let's call a method:

# busctl call org.freedesktop.login1 /org/freedesktop/login1/session/_31 org.freedesktop.login1.Session Lock

I don't think I need to mention this anymore, but anyway: again there's full command line completion available. The third argument is the interface name, the fourth the method name, both can be easily completed by pressing TAB. In this case we picked the Lock method, which activates the screen lock for the specific session. And yupp, the instant I pressed enter on this line my screen lock turned on (this only works on DEs that correctly hook into systemd-logind for this to work. GNOME works fine, and KDE should work too).

The Lock method call we picked is very simple, as it takes no parameters and returns none. Of course, it can get more complicated for some calls. Here's another example, this time using one of systemd's own bus calls, to start an arbitrary system unit:

# busctl call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StartUnit ss "cups.service" "replace"
o "/org/freedesktop/systemd1/job/42684"

This call takes two strings as input parameters, as we denote in the signature string that follows the method name (as usual, command line completion helps you getting this right). Following the signature the next two parameters are simply the two strings to pass. The specified signature string hence indicates what comes next. systemd's StartUnit method call takes the unit name to start as first parameter, and the mode in which to start it as second. The call returned a single object path value. It is encoded the same way as the input parameter: a signature (just o for the object path) followed by the actual value.

Of course, some method call parameters can get a ton more complex, but with busctl it's relatively easy to encode them all. See the man page for details.

busctl knows a number of other operations. For example, you can use it to monitor D-Bus traffic as it happens (including generating a .cap file for use with Wireshark!) or you can set or get specific properties. However, this blog story was supposed to be about sd-bus, not busctl, hence let's cut this short here, and let me direct you to the man page in case you want to know more about the tool.

busctl (like the rest of system) is implemented using the sd-bus API. Thus it exposes many of the features of sd-bus itself. For example, you can use to connect to remote or container buses. It understands both kdbus and classic D-Bus, and more!

sd-bus

But enough! Let's get back on topic, let's talk about sd-bus itself.

The sd-bus set of APIs is mostly contained in the header file sd-bus.h.

Here's a random selection of features of the library, that make it compare well with the other implementations available.

  • Supports both kdbus and dbus1 as back-end.

  • Has high-level support for connecting to remote buses via ssh, and to buses of local OS containers.

  • Powerful credential model, to implement authentication of clients in services. Currently 34 individual fields are supported, from the PID of the client to the cgroup or capability sets.

  • Support for tracking the life-cycle of peers in order to release local objects automatically when all peers referencing them disconnected.

  • The client builds an efficient decision tree to determine which handlers to deliver an incoming bus message to.

  • Automatically translates D-Bus errors into UNIX style errors and back (this is lossy though), to ensure best integration of D-Bus into low-level Linux programs.

  • Powerful but lightweight object model for exposing local objects on the bus. Automatically generates introspection as necessary.

The API is currently not fully documented, but we are working on completing the set of manual pages. For details see all pages starting with sd_bus_.

Invoking a Method, from C, with sd-bus

So much about the library in general. Here's an example for connecting to the bus and issuing a method call:

#include <stdio.h>
#include <stdlib.h>
#include <systemd/sd-bus.h>

int main(int argc, char *argv[]) {
        sd_bus_error error = SD_BUS_ERROR_NULL;
        sd_bus_message *m = NULL;
        sd_bus *bus = NULL;
        const char *path;
        int r;

        /* Connect to the system bus */
        r = sd_bus_open_system(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;
        }

        /* Issue the method call and store the respons message in m */
        r = sd_bus_call_method(bus,
                               "org.freedesktop.systemd1",           /* service to contact */
                               "/org/freedesktop/systemd1",          /* object path */
                               "org.freedesktop.systemd1.Manager",   /* interface name */
                               "StartUnit",                          /* method name */
                               &error,                               /* object to return error in */
                               &m,                                   /* return message on success */
                               "ss",                                 /* input signature */
                               "cups.service",                       /* first argument */
                               "replace");                           /* second argument */
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", error.message);
                goto finish;
        }

        /* Parse the response message */
        r = sd_bus_message_read(m, "o", &path);
        if (r < 0) {
                fprintf(stderr, "Failed to parse response message: %s\n", strerror(-r));
                goto finish;
        }

        printf("Queued service job as %s.\n", path);

finish:
        sd_bus_error_free(&error);
        sd_bus_message_unref(m);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-client.c, then build it with:

$ gcc bus-client.c -o bus-client `pkg-config --cflags --libs libsystemd`

This will generate a binary bus-client you can now run. Make sure to run it as root though, since access to the StartUnit method is privileged:

# ./bus-client
Queued service job as /org/freedesktop/systemd1/job/3586.

And that's it already, our first example. It showed how we invoked a method call on the bus. The actual function call of the method is very close to the busctl command line we used before. I hope the code excerpt needs little further explanation. It's supposed to give you a taste how to write D-Bus clients with sd-bus. For more more information please have a look at the header file, the man page or even the sd-bus sources.

Implementing a Service, in C, with sd-bus

Of course, just calling a single method is a rather simplistic example. Let's have a look on how to write a bus service. We'll write a small calculator service, that exposes a single object, which implements an interface that exposes two methods: one to multiply two 64bit signed integers, and one to divide one 64bit signed integer by another.

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <systemd/sd-bus.h>

static int method_multiply(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;
        }

        /* Reply with the response */
        return sd_bus_reply_method_return(m, "x", x * y);
}

static int method_divide(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;
        }

        /* Return an error on division by zero */
        if (y == 0) {
                sd_bus_error_set_const(ret_error, "net.poettering.DivisionByZero", "Sorry, can't allow division by zero.");
                return -EINVAL;
        }

        return sd_bus_reply_method_return(m, "x", x / y);
}

/* The vtable of our little object, implements the net.poettering.Calculator interface */
static const sd_bus_vtable calculator_vtable[] = {
        SD_BUS_VTABLE_START(0),
        SD_BUS_METHOD("Multiply", "xx", "x", method_multiply, SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_METHOD("Divide",   "xx", "x", method_divide,   SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_VTABLE_END
};

int main(int argc, char *argv[]) {
        sd_bus_slot *slot = NULL;
        sd_bus *bus = NULL;
        int r;

        /* Connect to the user bus this time */
        r = sd_bus_open_user(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;
        }

        /* Install the object */
        r = sd_bus_add_object_vtable(bus,
                                     &slot,
                                     "/net/poettering/Calculator",  /* object path */
                                     "net.poettering.Calculator",   /* interface name */
                                     calculator_vtable,
                                     NULL);
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", strerror(-r));
                goto finish;
        }

        /* Take a well-known service name so that clients can find us */
        r = sd_bus_request_name(bus, "net.poettering.Calculator", 0);
        if (r < 0) {
                fprintf(stderr, "Failed to acquire service name: %s\n", strerror(-r));
                goto finish;
        }

        for (;;) {
                /* Process requests */
                r = sd_bus_process(bus, NULL);
                if (r < 0) {
                        fprintf(stderr, "Failed to process bus: %s\n", strerror(-r));
                        goto finish;
                }
                if (r > 0) /* we processed a request, try to process another one, right-away */
                        continue;

                /* Wait for the next request to process */
                r = sd_bus_wait(bus, (uint64_t) -1);
                if (r < 0) {
                        fprintf(stderr, "Failed to wait on bus: %s\n", strerror(-r));
                        goto finish;
                }
        }

finish:
        sd_bus_slot_unref(slot);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-service.c, then build it with:

$ gcc bus-service.c -o bus-service `pkg-config --cflags --libs libsystemd`

Now, let's run it:

$ ./bus-service

In another terminal, let's try to talk to it. Note that this service is now on the user bus, not on the system bus as before. We do this for simplicity reasons: on the system bus access to services is tightly controlled so unprivileged clients cannot request privileged operations. On the user bus however things are simpler: as only processes of the user owning the bus can connect no further policy enforcement will complicate this example. Because the service is on the user bus, we have to pass the --user switch on the busctl command line. Let's start with looking at the service's object tree.

$ busctl --user tree net.poettering.Calculator
└─/net/poettering/Calculator

As we can see, there's only a single object on the service, which is not surprising, given that our code above only registered one. Let's see the interfaces and the members this object exposes:

$ busctl --user introspect net.poettering.Calculator /net/poettering/Calculator
NAME                                TYPE      SIGNATURE RESULT/VALUE FLAGS
net.poettering.Calculator           interface -         -            -
.Divide                             method    xx        x            -
.Multiply                           method    xx        x            -
org.freedesktop.DBus.Introspectable interface -         -            -
.Introspect                         method    -         s            -
org.freedesktop.DBus.Peer           interface -         -            -
.GetMachineId                       method    -         s            -
.Ping                               method    -         -            -
org.freedesktop.DBus.Properties     interface -         -            -
.Get                                method    ss        v            -
.GetAll                             method    s         a{sv}        -
.Set                                method    ssv       -            -
.PropertiesChanged                  signal    sa{sv}as  -            -

The sd-bus library automatically added a couple of generic interfaces, as mentioned above. But the first interface we see is actually the one we added! It shows our two methods, and both take "xx" (two 64bit signed integers) as input parameters, and return one "x". Great! But does it work?

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Multiply xx 5 7
x 35

Woohoo! We passed the two integers 5 and 7, and the service actually multiplied them for us and returned a single integer 35! Let's try the other method:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 99 17
x 5

Oh, wow! It can even do integer division! Fantastic! But let's trick it into dividing by zero:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 43 0
Sorry, can't allow division by zero.

Nice! It detected this nicely and returned a clean error about it. If you look in the source code example above you'll see how precisely we generated the error.

And that's really all I have for today. Of course, the examples I showed are short, and I don't get into detail here on what precisely each line does. However, this is supposed to be a short introduction into D-Bus and sd-bus, and it's already way too long for that …

I hope this blog story was useful to you. If you are interested in using sd-bus for your own programs, I hope this gets you started. If you have further questions, check the (incomplete) man pages, and inquire us on IRC or the systemd mailing list. If you need more examples, have a look at the systemd source tree, all of systemd's many bus services use sd-bus extensively.


systemd For Administrators, Part XXI

Container Integration

Since a while containers have been one of the hot topics on Linux. Container managers such as libvirt-lxc, LXC or Docker are widely known and used these days. In this blog story I want to shed some light on systemd's integration points with container managers, to allow seamless management of services across container boundaries.

We'll focus on OS containers here, i.e. the case where an init system runs inside the container, and the container hence in most ways appears like an independent system of its own. Much of what I describe here is available on pretty much any container manager that implements the logic described here, including libvirt-lxc. However, to make things easy we'll focus on systemd-nspawn, the mini-container manager that is shipped with systemd itself. systemd-nspawn uses the same kernel interfaces as the other container managers, however is less flexible as it is designed to be a container manager that is as simple to use as possible and "just works", rather than trying to be a generic tool you can configure in every low-level detail. We use systemd-nspawn extensively when developing systemd.

Anyway, so let's get started with our run-through. Let's start by creating a Fedora container tree in a subdirectory:

# yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in /srv/mycontainer. This command line is Fedora-specific, but most distributions provide similar functionality in one way or another. The examples section in the systemd-nspawn(1) man page contains a list of the various command lines for other distribution.

We now have the new container installed, let's set an initial root password:

# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then use passwd to set the root password. After that the initial setup is done, hence let's boot it up and log in as root with our new password:

$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Reached target Paths.
         Mounting Debug File System...
         Mounting Configuration File System...
         Mounting FUSE Control File System...
         Starting Create static device nodes in /dev...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Configuration File System.
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container integration of systemd. Let's have a look at the first tool, machinectl. When run without parameters it shows a list of all locally running containers:

$ machinectl
MACHINE                          CONTAINER SERVICE
mycontainer                      container nspawn

1 machines listed.

The "status" subcommand shows details about the container:

$ machinectl status mycontainer
mycontainer:
       Since: Mi 2014-11-12 16:47:19 CET; 51s ago
      Leader: 5374 (systemd)
     Service: nspawn; class container
        Root: /srv/mycontainer
     Address: 192.168.178.38
              10.36.6.162
              fd00::523f:56ff:fe00:4994
              fe80::523f:56ff:fe00:4994
          OS: Fedora 20 (Heisenbug)
        Unit: machine-mycontainer.scope
              ├─5374 /usr/lib/systemd/systemd
              └─system.slice
                ├─dbus.service
                │ └─5414 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-act...
                ├─systemd-journald.service
                │ └─5383 /usr/lib/systemd/systemd-journald
                ├─systemd-logind.service
                │ └─5411 /usr/lib/systemd/systemd-logind
                └─console-getty.service
                  └─5416 /sbin/agetty --noclear -s console 115200 38400 9600

With this we see some interesting information about the container, including its control group tree (with processes), IP addresses and root directory.

The "login" subcommand gets us a new login shell in the container:

# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The "reboot" subcommand reboots the container:

# machinectl reboot mycontainer

The "poweroff" subcommand powers the container off:

# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more commands, please check the man page for details. Note again that even though we use systemd-nspawn as container manager here the concepts apply to any container manager that implements the logic described here, including libvirt-lxc for example.

machinectl is not the only tool that is useful in conjunction with containers. Many of systemd's own tools have been updated to explicitly support containers too! Let's try this (after starting the container up again first, repeating the systemd-nspawn command from above.):

# hostnamectl -M mycontainer set-hostname "wuff"

This uses hostnamectl(1) on the local container and sets its hostname.

Similar, many other tools have been updated for connecting to local containers. Here's systemctl(1)'s -M switch in action:

# systemctl -M mycontainer
UNIT                                 LOAD   ACTIVE SUB       DESCRIPTION
-.mount                              loaded active mounted   /
dev-hugepages.mount                  loaded active mounted   Huge Pages File System
dev-mqueue.mount                     loaded active mounted   POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted   /proc/sys/kernel/random/boot_id
[...]
time-sync.target                     loaded active active    System Time Synchronized
timers.target                        loaded active active    Timers
systemd-tmpfiles-clean.timer         loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified container, not the host. (Output is shortened here, the blog story is already getting too long).

Let's use this to restart a service within our container:

# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M switch. With the -r switch it shows the units running on the host, plus all units of all local, running containers:

# systemctl -r
UNIT                                        LOAD   ACTIVE SUB       DESCRIPTION
boot.automount                              loaded active waiting   EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount           loaded active waiting   Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0\x2dLVDS\x2d1-intel_backlight.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[...]
timers.target                                                                                       loaded active active    Timers
mandb.timer                                                                                         loaded active waiting   Daily man-db cache update
systemd-tmpfiles-clean.timer                                                                        loaded active waiting   Daily Cleanup of Temporary Directories
mycontainer:-.mount                                                                                 loaded active mounted   /
mycontainer:dev-hugepages.mount                                                                     loaded active mounted   Huge Pages File System
mycontainer:dev-mqueue.mount                                                                        loaded active mounted   POSIX Message Queue File System
[...]
mycontainer:time-sync.target                                                                        loaded active active    System Time Synchronized
mycontainer:timers.target                                                                           loaded active active    Timers
mycontainer:systemd-tmpfiles-clean.timer                                                            loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the units of the one container we have currently running. The units of the containers are prefixed with the container name, and a colon (":"). (The output is shortened again for brevity's sake.)

The list-machines subcommand of systemctl shows a list of all running containers, inquiring the system managers within the containers about system state and health. More specifically it shows if containers are properly booted up, or if there are any failed services:

# systemctl list-machines
NAME         STATE   FAILED JOBS
delta (host) running      0    0
mycontainer  running      0    0
miau         degraded     1    0
waldi        running      0    0

4 machines listed.

To make things more interesting we have started two more containers in parallel. One of them has a failed service, which results in the machine state to be degraded.

Let's have a look at journalctl(1)'s container support. It too supports -M to show the logs of a specific container:

# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the host and all local containers:

# journalctl -m -e

(Let's skip the output here completely, I figure you can extrapolate how this looks.)

But it's not only systemd's own tools that understand container support these days, procps sports support for it, too:

# ps -eo pid,machine,args
 PID MACHINE                         COMMAND
   1 -                               /usr/lib/systemd/systemd --switched-root --system --deserialize 20
[...]
2915 -                               emacs contents/projects/containers.md
3403 -                               [kworker/u16:7]
3415 -                               [kworker/u16:9]
4501 -                               /usr/libexec/nm-vpnc-service
4519 -                               /usr/sbin/vpnc --non-inter --no-detach --pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid -
4749 -                               /usr/libexec/dconf-service
4980 -                               /usr/lib/systemd/systemd-resolved
5006 -                               /usr/lib64/firefox/firefox
5168 -                               [kworker/u16:0]
5192 -                               [kworker/u16:4]
5193 -                               [kworker/u16:5]
5497 -                               [kworker/u16:1]
5591 -                               [kworker/u16:8]
5711 -                               sudo -s
5715 -                               /bin/bash
5749 -                               /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer                     /usr/lib/systemd/systemd
5799 mycontainer                     /usr/lib/systemd/systemd-journald
5862 mycontainer                     /usr/lib/systemd/systemd-logind
5863 mycontainer                     /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
5868 mycontainer                     /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt102
5871 mycontainer                     /usr/sbin/sshd -D
6527 mycontainer                     /usr/lib/systemd/systemd-resolved
[...]

This shows a process list (shortened). The second column shows the container a process belongs to. All processes shown with "-" belong to the host itself.

But it doesn't stop there. The new "sd-bus" D-Bus client library we have been preparing in the systemd/kdbus context knows containers too. While you use sd_bus_open_system() to connect to your local host's system bus sd_bus_open_system_container() may be used to connect to the system bus of any local container, so that you can execute bus methods on it.

sd-login.h and machined's bus interface provide a number of APIs to add container support to other programs too. They support enumeration of containers as well as retrieving the machine name from a PID and similar.

systemd-networkd also has support for containers. When run inside a container it will by default run a DHCP client and IPv4LL on any veth network interface named host0 (this interface is special under the logic described here). When run on the host networkd will by default provide a DHCP server and IPv4LL on veth network interface named ve- followed by a container name.

Let's have a look at one last facet of systemd's container integration: the hook-up with the name service switch. Recent systemd versions contain a new NSS module nss-mymachines that make the names of all local containers resolvable via gethostbyname() and getaddrinfo(). This only applies to containers that run within their own network namespace. With the systemd-nspawn command shown above the the container shares the network configuration with the host however; hence let's restart the container, this time with a virtual veth network link between host and container:

# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer --network-veth -b

Now, (assuming that networkd is used in the container and outside) we can already ping the container using its name, due to the simple magic of nss-mymachines:

# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with all other tools that use libc gethostbyname() or getaddrinfo() too, among them venerable ssh.

And this is pretty much all I want to cover for now. We briefly touched a variety of integration points, and there's a lot more still if you look closely. We are working on even more container integration all the time, so expect more new features in this area with every systemd release.

Note that the whole machine concept is actually not limited to containers, but covers VMs too to a certain degree. However, the integration is not as close, as access to a VM's internals is not as easy as for containers, as it usually requires a network transport instead of allowing direct syscall access.

Anyway, I hope this is useful. For further details, please have a look at the linked man pages and other documentation.


Revisiting How We Put Together Linux Systems

In a previous blog story I discussed Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems, I now want to take the opportunity to explain a bit where we want to take this with systemd in the longer run, and what we want to build out of it. This is going to be a longer story, so better grab a cold bottle of Club Mate before you start reading.

Traditional Linux distributions are built around packaging systems like RPM or dpkg, and an organization model where upstream developers and downstream packagers are relatively clearly separated: an upstream developer writes code, and puts it somewhere online, in a tarball. A packager than grabs it and turns it into RPMs/DEBs. The user then grabs these RPMs/DEBs and installs them locally on the system. For a variety of uses this is a fantastic scheme: users have a large selection of readily packaged software available, in mostly uniform packaging, from a single source they can trust. In this scheme the distribution vets all software it packages, and as long as the user trusts the distribution all should be good. The distribution takes the responsibility of ensuring the software is not malicious, of timely fixing security problems and helping the user if something is wrong.

Upstream Projects

However, this scheme also has a number of problems, and doesn't fit many use-cases of our software particularly well. Let's have a look at the problems of this scheme for many upstreams:

  • Upstream software vendors are fully dependent on downstream distributions to package their stuff. It's the downstream distribution that decides on schedules, packaging details, and how to handle support. Often upstream vendors want much faster release cycles then the downstream distributions follow.

  • Realistic testing is extremely unreliable and next to impossible. Since the end-user can run a variety of different package versions together, and expects the software he runs to just work on any combination, the test matrix explodes. If upstream tests its version on distribution X release Y, then there's no guarantee that that's the precise combination of packages that the end user will eventually run. In fact, it is very unlikely that the end user will, since most distributions probably updated a number of libraries the package relies on by the time the package ends up being made available to the user. The fact that each package can be individually updated by the user, and each user can combine library versions, plug-ins and executables relatively freely, results in a high risk of something going wrong.

  • Since there are so many different distributions in so many different versions around, if upstream tries to build and test software for them it needs to do so for a large number of distributions, which is a massive effort.

  • The distributions are actually quite different in many ways. In fact, they are different in a lot of the most basic functionality. For example, the path where to put x86-64 libraries is different on Fedora and Debian derived systems..

  • Developing software for a number of distributions and versions is hard: if you want to do it, you need to actually install them, each one of them, manually, and then build your software for each.

  • Since most downstream distributions have strict licensing and trademark requirements (and rightly so), any kind of closed source software (or otherwise non-free) does not fit into this scheme at all.

This all together makes it really hard for many upstreams to work nicely with the current way how Linux works. Often they try to improve the situation for them, for example by bundling libraries, to make their test and build matrices smaller.

System Vendors

The toolbox approach of classic Linux distributions is fantastic for people who want to put together their individual system, nicely adjusted to exactly what they need. However, this is not really how many of today's Linux systems are built, installed or updated. If you build any kind of embedded device, a server system, or even user systems, you frequently do your work based on complete system images, that are linearly versioned. You build these images somewhere, and then you replicate them atomically to a larger number of systems. On these systems, you don't install or remove packages, you get a defined set of files, and besides installing or updating the system there are no ways how to change the set of tools you get.

The current Linux distributions are not particularly good at providing for this major use-case of Linux. Their strict focus on individual packages as well as package managers as end-user install and update tool is incompatible with what many system vendors want.

Users

The classic Linux distribution scheme is frequently not what end users want, either. Many users are used to app markets like Android, Windows or iOS/Mac have. Markets are a platform that doesn't package, build or maintain software like distributions do, but simply allows users to quickly find and download the software they need, with the app vendor responsible for keeping the app updated, secured, and all that on the vendor's release cycle. Users tend to be impatient. They want their software quickly, and the fine distinction between trusting a single distribution or a myriad of app developers individually is usually not important for them. The companies behind the marketplaces usually try to improve this trust problem by providing sand-boxing technologies: as a replacement for the distribution that audits, vets, builds and packages the software and thus allows users to trust it to a certain level, these vendors try to find technical solutions to ensure that the software they offer for download can't be malicious.

Existing Approaches To Fix These Problems

Now, all the issues pointed out above are not new, and there are sometimes quite successful attempts to do something about it. Ubuntu Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of this problem set, usually with a strict focus on one facet of Linux systems. For example, Ubuntu Apps focus strictly on end user (desktop) applications, and don't care about how we built/update/install the OS itself, or containers. Docker OTOH focuses on containers only, and doesn't care about end-user apps. Software Collections tries to focus on the development environments. ChromeOS focuses on the OS itself, but only for end-user devices. CoreOS also focuses on the OS, but only for server systems.

The approaches they find are usually good at specific things, and use a variety of different technologies, on different layers. However, none of these projects tried to fix this problems in a generic way, for all uses, right in the core components of the OS itself.

Linux has come to tremendous successes because its kernel is so generic: you can build supercomputers and tiny embedded devices out of it. It's time we come up with a basic, reusable scheme how to solve the problem set described above, that is equally generic.

What We Want

The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom Gundersen, David Herrmann, and yours truly) recently met in Berlin about all these things, and tried to come up with a scheme that is somewhat simple, but tries to solve the issues generically, for all use-cases, as part of the systemd project. All that in a way that is somewhat compatible with the current scheme of distributions, to allow a slow, gradual adoption. Also, and that's something one cannot stress enough: the toolbox scheme of classic Linux distributions is actually a good one, and for many cases the right one. However, we need to make sure we make distributions relevant again for all use-cases, not just those of highly individualized systems.

Anyway, so let's summarize what we are trying to do:

  • We want an efficient way that allows vendors to package their software (regardless if just an app, or the whole OS) directly for the end user, and know the precise combination of libraries and packages it will operate with.

  • We want to allow end users and administrators to install these packages on their systems, regardless which distribution they have installed on it.

  • We want a unified solution that ultimately can cover updates for full systems, OS containers, end user apps, programming ABIs, and more. These updates shall be double-buffered, (at least). This is an absolute necessity if we want to prepare the ground for operating systems that manage themselves, that can update safely without administrator involvement.

  • We want our images to be trustable (i.e. signed). In fact we want a fully trustable OS, with images that can be verified by a full trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the kernel, and initrd. Cryptographically secure verification of the code we execute is relevant on the desktop (like ChromeOS does), but also for apps, for embedded devices and even on servers (in a post-Snowden world, in particular).

What We Propose

So much about the set of problems, and what we are trying to do. So, now, let's discuss the technical bits we came up with:

The scheme we propose is built around the variety of concepts of btrfs and Linux file system name-spacing. btrfs at this point already has a large number of features that fit neatly in our concept, and the maintainers are busy working on a couple of others we want to eventually make use of.

As first part of our proposal we make heavy use of btrfs sub-volumes and introduce a clear naming scheme for them. We name snapshots like this:

  • usr:<vendorid>:<architecture>:<version> -- This refers to a full vendor operating system tree. It's basically a /usr tree (and no other directories), in a specific version, with everything you need to boot it up inside it. The <vendorid> field is replaced by some vendor identifier, maybe a scheme like org.fedoraproject.FedoraWorkstation. The <architecture> field specifies a CPU architecture the OS is designed for, for example x86-64. The <version> field specifies a specific OS version, for example 23.4. An example sub-volume name could hence look like this: usr:org.fedoraproject.FedoraWorkstation:x86_64:23.4

  • root:<name>:<vendorid>:<architecture> -- This refers to an instance of an operating system. Its basically a root directory, containing primarily /etc and /var (but possibly more). Sub-volumes of this type do not contain a populated /usr tree though. The <name> field refers to some instance name (maybe the host name of the instance). The other fields are defined as above. An example sub-volume name is root:revolution:org.fedoraproject.FedoraWorkstation:x86_64.

  • runtime:<vendorid>:<architecture>:<version> -- This refers to a vendor runtime. A runtime here is supposed to be a set of libraries and other resources that are needed to run apps (for the concept of apps see below), all in a /usr tree. In this regard this is very similar to the usr sub-volumes explained above, however, while a usr sub-volume is a full OS and contains everything necessary to boot, a runtime is really only a set of libraries. You cannot boot it, but you can run apps with it. An example sub-volume name is: runtime:org.gnome.GNOME3_20:x86_64:3.20.1

  • framework:<vendorid>:<architecture>:<version> -- This is very similar to a vendor runtime, as described above, it contains just a /usr tree, but goes one step further: it additionally contains all development headers, compilers and build tools, that allow developing against a specific runtime. For each runtime there should be a framework. When you develop against a specific framework in a specific architecture, then the resulting app will be compatible with the runtime of the same vendor ID and architecture. Example: framework:org.gnome.GNOME3_20:x86_64:3.20.1

  • app:<vendorid>:<runtime>:<architecture>:<version> -- This encapsulates an application bundle. It contains a tree that at runtime is mounted to /opt/<vendorid>, and contains all the application's resources. The <vendorid> could be a string like org.libreoffice.LibreOffice, the <runtime> refers to one the vendor id of one specific runtime the application is built for, for example org.gnome.GNOME3_20:3.20.1. The <architecture> and <version> refer to the architecture the application is built for, and of course its version. Example: app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133

  • home:<user>:<uid>:<gid> -- This sub-volume shall refer to the home directory of the specific user. The <user> field contains the user name, the <uid> and <gid> fields the numeric Unix UIDs and GIDs of the user. The idea here is that in the long run the list of sub-volumes is sufficient as a user database (but see below). Example: home:lennart:1000:1000.

btrfs partitions that adhere to this naming scheme should be clearly identifiable. It is our intention to introduce a new GPT partition type ID for this.

How To Use It

After we introduced this naming scheme let's see what we can build of this:

  • When booting up a system we mount the root directory from one of the root sub-volumes, and then mount /usr from a matching usr sub-volume. Matching here means it carries the same <vendor-id> and <architecture>. Of course, by default we should pick the matching usr sub-volume with the newest version by default.

  • When we boot up an OS container, we do exactly the same as the when we boot up a regular system: we simply combine a usr sub-volume with a root sub-volume.

  • When we enumerate the system's users we simply go through the list of home snapshots.

  • When a user authenticates and logs in we mount his home directory from his snapshot.

  • When an app is run, we set up a new file system name-space, mount the app sub-volume to /opt/<vendorid>/, and the appropriate runtime sub-volume the app picked to /usr, as well as the user's /home/$USER to its place.

  • When a developer wants to develop against a specific runtime he installs the right framework, and then temporarily transitions into a name space where /usris mounted from the framework sub-volume, and /home/$USER from his own home directory. In this name space he then runs his build commands. He can build in multiple name spaces at the same time, if he intends to builds software for multiple runtimes or architectures at the same time.

Instantiating a new system or OS container (which is exactly the same in this scheme) just consists of creating a new appropriately named root sub-volume. Completely naturally you can share one vendor OS copy in one specific version with a multitude of container instances.

Everything is double-buffered (or actually, n-fold-buffered), because usr, runtime, framework, app sub-volumes can exist in multiple versions. Of course, by default the execution logic should always pick the newest release of each sub-volume, but it is up to the user keep multiple versions around, and possibly execute older versions, if he desires to do so. In fact, like on ChromeOS this could even be handled automatically: if a system fails to boot with a newer snapshot, the boot loader can automatically revert back to an older version of the OS.

An Example

Note that in result this allows installing not only multiple end-user applications into the same btrfs volume, but also multiple operating systems, multiple system instances, multiple runtimes, multiple frameworks. Or to spell this out in an example:

Let's say Fedora, Mageia and ArchLinux all implement this scheme, and provide ready-made end-user images. Also, the GNOME, KDE, SDL projects all define a runtime+framework to develop against. Finally, both LibreOffice and Firefox provide their stuff according to this scheme. You can now trivially install of these into the same btrfs volume:

  • usr:org.fedoraproject.WorkStation:x86_64:24.7
  • usr:org.fedoraproject.WorkStation:x86_64:24.8
  • usr:org.fedoraproject.WorkStation:x86_64:24.9
  • usr:org.fedoraproject.WorkStation:x86_64:25beta
  • usr:org.mageia.Client:i386:39.3
  • usr:org.mageia.Client:i386:39.4
  • usr:org.mageia.Client:i386:39.6
  • usr:org.archlinux.Desktop:x86_64:302.7.8
  • usr:org.archlinux.Desktop:x86_64:302.7.9
  • usr:org.archlinux.Desktop:x86_64:302.7.10
  • root:revolution:org.fedoraproject.WorkStation:x86_64
  • root:testmachine:org.fedoraproject.WorkStation:x86_64
  • root:foo:org.mageia.Client:i386
  • root:bar:org.archlinux.Desktop:x86_64
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.1
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.4
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.5
  • runtime:org.gnome.GNOME3_22:x86_64:3.22.0
  • runtime:org.kde.KDE5_6:x86_64:5.6.0
  • framework:org.gnome.GNOME3_22:x86_64:3.22.0
  • framework:org.kde.KDE5_6:x86_64:5.6.0
  • app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133
  • app:org.libreoffice.LibreOffice:GNOME3_22:x86_64:166
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:39
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:40
  • home:lennart:1000:1000
  • home:hrundivbakshi:1001:1001

In the example above, we have three vendor operating systems installed. All of them in three versions, and one even in a beta version. We have four system instances around. Two of them of Fedora, maybe one of them we usually boot from, the other we run for very specific purposes in an OS container. We also have the runtimes for two GNOME releases in multiple versions, plus one for KDE. Then, we have the development trees for one version of KDE and GNOME around, as well as two apps, that make use of two releases of the GNOME runtime. Finally, we have the home directories of two users.

Now, with the name-spacing concepts we introduced above, we can actually relatively freely mix and match apps and OSes, or develop against specific frameworks in specific versions on any operating system. It doesn't matter if you booted your ArchLinux instance, or your Fedora one, you can execute both LibreOffice and Firefox just fine, because at execution time they get matched up with the right runtime, and all of them are available from all the operating systems you installed. You get the precise runtime that the upstream vendor of Firefox/LibreOffice did their testing with. It doesn't matter anymore which distribution you run, and which distribution the vendor prefers.

Also, given that the user database is actually encoded in the sub-volume list, it doesn't matter which system you boot, the distribution should be able to find your local users automatically, without any configuration in /etc/passwd.

Building Blocks

With this naming scheme plus the way how we can combine them on execution we already came quite far, but how do we actually get these sub-volumes onto the final machines, and how do we update them? Well, btrfs has a feature they call "send-and-receive". It basically allows you to "diff" two file system versions, and generate a binary delta. You can generate these deltas on a developer's machine and then push them into the user's system, and he'll get the exact same sub-volume too. This is how we envision installation and updating of operating systems, applications, runtimes, frameworks. At installation time, we simply deserialize an initial send-and-receive delta into our btrfs volume, and later, when a new version is released we just add in the few bits that are new, by dropping in another send-and-receive delta under a new sub-volume name. And we do it exactly the same for the OS itself, for a runtime, a framework or an app. There's no technical distinction anymore. The underlying operation for installing apps, runtime, frameworks, vendor OSes, as well as the operation for updating them is done the exact same way for all.

Of course, keeping multiple full /usr trees around sounds like an awful lot of waste, after all they will contain a lot of very similar data, since a lot of resources are shared between distributions, frameworks and runtimes. However, thankfully btrfs actually is able to de-duplicate this for us. If we add in a new app snapshot, this simply adds in the new files that changed. Moreover different runtimes and operating systems might actually end up sharing the same tree.

Even though the example above focuses primarily on the end-user, desktop side of things, the concept is also extremely powerful in server scenarios. For example, it is easy to build your own usr trees and deliver them to your hosts using this scheme. The usr sub-volumes are supposed to be something that administrators can put together. After deserializing them into a couple of hosts, you can trivially instantiate them as OS containers there, simply by adding a new root sub-volume for each instance, referencing the usr tree you just put together. Instantiating OS containers hence becomes as easy as creating a new btrfs sub-volume. And you can still update the images nicely, get fully double-buffered updates and everything.

And of course, this scheme also applies great to embedded use-cases. Regardless if you build a TV, an IVI system or a phone: you can put together you OS versions as usr trees, and then use btrfs-send-and-receive facilities to deliver them to the systems, and update them there.

Many people when they hear the word "btrfs" instantly reply with "is it ready yet?". Thankfully, most of the functionality we really need here is strictly read-only. With the exception of the home sub-volumes (see below) all snapshots are strictly read-only, and are delivered as immutable vendor trees onto the devices. They never are changed. Even if btrfs might still be immature, for this kind of read-only logic it should be more than good enough.

Note that this scheme also enables doing fat systems: for example, an installer image could include a Fedora version compiled for x86-64, one for i386, one for ARM, all in the same btrfs volume. Due to btrfs' de-duplication they will share as much as possible, and when the image is booted up the right sub-volume is automatically picked. Something similar of course applies to the apps too!

This also allows us to implement something that we like to call Operating-System-As-A-Virus. Installing a new system is little more than:

  • Creating a new GPT partition table
  • Adding an EFI System Partition (FAT) to it
  • Adding a new btrfs volume to it
  • Deserializing a single usr sub-volume into the btrfs volume
  • Installing a boot loader into the EFI System Partition
  • Rebooting

Now, since the only real vendor data you need is the usr sub-volume, you can trivially duplicate this onto any block device you want. Let's say you are a happy Fedora user, and you want to provide a friend with his own installation of this awesome system, all on a USB stick. All you have to do for this is do the steps above, using your installed usr tree as source to copy. And there you go! And you don't have to be afraid that any of your personal data is copied too, as the usr sub-volume is the exact version your vendor provided you with. Or with other words: there's no distinction anymore between installer images and installed systems. It's all the same. Installation becomes replication, not more. Live-CDs and installed systems can be fully identical.

Note that in this design apps are actually developed against a single, very specific runtime, that contains all libraries it can link against (including a specific glibc version!). Any library that is not included in the runtime the developer picked must be included in the app itself. This is similar how apps on Android declare one very specific Android version they are developed against. This greatly simplifies application installation, as there's no dependency hell: each app pulls in one runtime, and the app is actually free to pick which one, as you can have multiple installed, though only one is used by each app.

Also note that operating systems built this way will never see "half-updated" systems, as it is common when a system is updated using RPM/dpkg. When updating the system the code will either run the old or the new version, but it will never see part of the old files and part of the new files. This is the same for apps, runtimes, and frameworks, too.

Where We Are Now

We are currently working on a lot of the groundwork necessary for this. This scheme relies on the ability to monopolize the vendor OS resources in /usr, which is the key of what I described in Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems a few weeks back. Then, of course, for the full desktop app concept we need a strong sandbox, that does more than just hiding files from the file system view. After all with an app concept like the above the primary interfacing between the executed desktop apps and the rest of the system is via IPC (which is why we work on kdbus and teach it all kinds of sand-boxing features), and the kernel itself. Harald Hoyer has started working on generating the btrfs send-and-receive images based on Fedora.

Getting to the full scheme will take a while. Currently we have many of the building blocks ready, but some major items are missing. For example, we push quite a few problems into btrfs, that other solutions try to solve in user space. One of them is actually signing/verification of images. The btrfs maintainers are working on adding this to the code base, but currently nothing exists. This functionality is essential though to come to a fully verified system where a trust chain exists all the way from the firmware to the apps. Also, to make the home sub-volume scheme fully workable we actually need encrypted sub-volumes, so that the sub-volume's pass-phrase can be used for authenticating users in PAM. This doesn't exist either.

Working towards this scheme is a gradual process. Many of the steps we require for this are useful outside of the grand scheme though, which means we can slowly work towards the goal, and our users can already take benefit of what we are working on as we go.

Also, and most importantly, this is not really a departure from traditional operating systems:

Each app, each OS and each app sees a traditional Unix hierarchy with /usr, /home, /opt, /var, /etc. It executes in an environment that is pretty much identical to how it would be run on traditional systems.

There's no need to fully move to a system that uses only btrfs and follows strictly this sub-volume scheme. For example, we intend to provide implicit support for systems that are installed on ext4 or xfs, or that are put together with traditional packaging tools such as RPM or dpkg: if the the user tries to install a runtime/app/framework/os image on a system that doesn't use btrfs so far, it can just create a loop-back btrfs image in /var, and push the data into that. Even us developers will run our stuff like this for a while, after all this new scheme is not particularly useful for highly individualized systems, and we developers usually tend to run systems like that.

Also note that this in no way a departure from packaging systems like RPM or DEB. Even if the new scheme we propose is used for installing and updating a specific system, it is RPM/DEB that is used to put together the vendor OS tree initially. Hence, even in this scheme RPM/DEB are highly relevant, though not strictly as an end-user tool anymore, but as a build tool.

So Let's Summarize Again What We Propose

  • We want a unified scheme, how we can install and update OS images, user apps, runtimes and frameworks.

  • We want a unified scheme how you can relatively freely mix OS images, apps, runtimes and frameworks on the same system.

  • We want a fully trusted system, where cryptographic verification of all executed code can be done, all the way to the firmware, as standard feature of the system.

  • We want to allow app vendors to write their programs against very specific frameworks, under the knowledge that they will end up being executed with the exact same set of libraries chosen.

  • We want to allow parallel installation of multiple OSes and versions of them, multiple runtimes in multiple versions, as well as multiple frameworks in multiple versions. And of course, multiple apps in multiple versions.

  • We want everything double buffered (or actually n-fold buffered), to ensure we can reliably update/rollback versions, in particular to safely do automatic updates.

  • We want a system where updating a runtime, OS, framework, or OS container is as simple as adding in a new snapshot and restarting the runtime/OS/framework/OS container.

  • We want a system where we can easily instantiate a number of OS instances from a single vendor tree, with zero difference for doing this on order to be able to boot it on bare metal/VM or as a container.

  • We want to enable Linux to have an open scheme that people can use to build app markets and similar schemes, not restricted to a specific vendor.

Final Words

I'll be talking about this at LinuxCon Europe in October. I originally intended to discuss this at the Linux Plumbers Conference (which I assumed was the right forum for this kind of major plumbing level improvement), and at linux.conf.au, but there was no interest in my session submissions there...

Of course this is all work in progress. These are our current ideas we are working towards. As we progress we will likely change a number of things. For example, the precise naming of the sub-volumes might look very different in the end.

Of course, we are developers of the systemd project. Implementing this scheme is not just a job for the systemd developers. This is a reinvention how distributions work, and hence needs great support from the distributions. We really hope we can trigger some interest by publishing this proposal now, to get the distributions on board. This after all is explicitly not supposed to be a solution for one specific project and one specific vendor product, we care about making this open, and solving it for the generic case, without cutting corners.

If you have any questions about this, you know how you can reach us (IRC, mail, G+, ...).

The future is going to be awesome!


FUDCON + GNOME.Asia Beijing 2014

Thanks to the funding from FUDCON I had the chance to attend and keynote at the combined FUDCON Beijing 2014 and GNOME.Asia 2014 conference in Beijing, China.

My talk was about systemd's present and future, what we achieved and where we are going. In my talk I tried to explain a bit where we are coming from, and how we changed focus from being purely an init system, to more being a set of basic building blocks to build an OS from. Most of the talk I talked about where we still intend to take systemd, which areas we believe should be covered by systemd, and of course, also the always difficult question, on where to draw the line and what clearly is outside of the focus of systemd. The slides of my talk you find online. (No video recording I am aware of, sorry.)

The combined conferences were a lot of fun, and as usual, the best discussions I had in the hallway track, discussing Linux and systemd.

A number of pictures of the conference are now online. Enjoy!

After the conference I stayed for a few more days in Beijing, doing a bit of sightseeing. What a fantastic city! The food was amazing, we tried all kinds of fantastic stuff, from Peking duck, to Bullfrog Sechuan style. Yummy. And one of those days I am sure I will find the time to actually sort my photos and put them online, too.

I am really looking forward to the next FUDCON/GNOME.Asia!


Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems

(Just a small heads-up: I don't blog as much as I used to, I nowadays update my Google+ page a lot more frequently. You might want to subscribe that if you are interested in more frequent technical updates on what we are working on.)

In the past weeks we have been working on a couple of features for systemd that enable a number of new usecases I'd like to shed some light on. Taking benefit of the /usr merge that a number of distributions have completed we want to bring runtime behaviour of Linux systems to the next level. With the /usr merge completed most static vendor-supplied OS data is found exclusively in /usr, only a few additional bits in /var and /etc are necessary to make a system boot. On this we can build to enable a couple of new features:

  1. A mechanism we call Factory Reset shall flush out /etc and /var, but keep the vendor-supplied /usr, bringing the system back into a well-defined, pristine vendor state with no local state or configuration. This functionality is useful across the board from servers, to desktops, to embedded devices.
  2. A Stateless System goes one step further: a system like this never stores /etc or /var on persistent storage, but always comes up with pristine vendor state. On systems like this every reboot acts as factor reset. This functionality is particularly useful for simple containers or systems that boot off the network or read-only media, and receive all configuration they need during runtime from vendor packages or protocols like DHCP or are capable of discovering their parameters automatically from the available hardware or periphery.
  3. Reproducible Systems multiply a vendor image into many containers or systems. Only local configuration or state is stored per-system, while the vendor operating system is pulled in from the same, immutable, shared snapshot. Each system hence has its private /etc and /var for receiving local configuration, however the OS tree in /usr is pulled in via bind mounts (in case of containers) or technologies like NFS (in case of physical systems), or btrfs snapshots from a golden master image. This is particular interesting for containers where the goal is to run thousands of container images from the same OS tree. However, it also has a number of other usecases, for example thin client systems, which can boot the same NFS share a number of times. Furthermore this mechanism is useful to implement very simple OS installers, that simply unserialize a /usr snapshot into a file system, install a boot loader, and reboot.
  4. Verifiable Systems are closely related to stateless systems: if the underlying storage technology can cryptographically ensure that the vendor-supplied OS is trusted and in a consistent state, then it must be made sure that /etc or /var are either included in the OS image, or simply unnecessary for booting.

Concepts

A number of Linux-based operating systems have tried to implement some of the schemes described out above in one way or another. Particularly interesting are GNOME's OSTree, CoreOS and Google's Android and ChromeOS. They generally found different solutions for the specific problems you have when implementing schemes like this, sometimes taking shortcuts that keep only the specific case in mind, and cannot cover the general purpose. With systemd now being at the core of so many distributions and deeply involved in bringing up and maintaining the system we came to the conclusion that we should attempt to add generic support for setups like this to systemd itself, to open this up for the general purpose distributions to build on. We decided to focus on three kinds of systems:

  1. The stateful system, the traditional system as we know it with machine-specific /etc, /usr and /var, all properly populated.
  2. Startup without a populated /var, but with configured /etc. (We will call these volatile systems.)
  3. Startup without either /etc or /var. (We will call these stateless systems.)

A factory reset is just a special case of the latter two modes, where the system boots up without /var and /etc but the next boot is a normal stateful boot like like the first described mode. Note that a mode where /etc is flushed, but /var is not is nothing we intend to cover (why? well, the user ID question becomes much harder, see below, and we simply saw no usecase for it worth the trouble).

Problems

Booting up a system without a populated /var is relatively straight-forward. With a few lines of tmpfiles configuration it is possible to populate /var with its basic structure in a way that is sufficient to make a system boot cleanly. systemd version 214 and newer ship with support for this. Of course, support for this scheme in systemd is only a small part of the solution. While a lot of software reconstructs the directory hierarchy it needs in /var automatically, many software does not. In case like this it is necessary to ship a couple of additional tmpfiles lines that setup up at boot-time the necessary files or directories in /var to make the software operate, similar to what RPM or DEB packages would set up at installation time.

Booting up a system without a populated /etc is a more difficult task. In /etc we have a lot of configuration bits that are essential for the system to operate, for example and most importantly system user and group information in /etc/passwd and /etc/group. If the system boots up without /etc there must be a way to replicate the minimal information necessary in it, so that the system manages to boot up fully.

To make this even more complex, in order to support "offline" updates of /usr that are replicated into a number of systems possessing private /etc and /var there needs to be a way how these directories can be upgraded transparently when necessary, for example by recreating caches like /etc/ld.so.cache or adding missing system users to /etc/passwd on next reboot.

Starting with systemd 215 (yet unreleased, as I type this) we will ship with a number of features in systemd that make /etc-less boots functional:

  • A new tool systemd-sysusers as been added. It introduces a new drop-in directory /usr/lib/sysusers.d/. Minimal descriptions of necessary system users and groups can be placed there. Whenever the tool is invoked it will create these users in /etc/passwd and /etc/group should they be missing. It is only suitable for creating system users and groups, not for normal users. It will write to the files directly via the appropriate glibc APIs, which is the right thing to do for system users. (For normal users no such APIs exist, as the users might be stored centrally on LDAP or suchlike, and they are out of focus for our usecase.) The major benefit of this tool is that system user definition can happen offline: a package simply has to drop in a new file to register a user. This makes system user registration declarative instead of imperative -- which is the way how system users are traditionally created from RPM or DEB installation scripts. By being declarative it is easy to replicate the users on next boot to a number of system instances.

    To make this new tool interesting for packaging scripts we make it easy to alternatively invoke it during package installation time, thus being a good alternative to invocations of useradd -r and groupadd -r.

    Some OS designs use a static, fixed user/group list stored in /usr as primary database for users/groups, which fixed UID/GID mappings. While this works for specific systems, this cannot cover the general purpose. As the UID/GID range for system users/groups is very small (only containing 998 users and groups on most systems), the best has to be made from this space and only UIDs/GIDs necessary on the specific system should be allocated. This means allocation has to be dynamic and adjust to what is necessary.

    Also note that this tool has one very nice feature: in addition to fully dynamic, and fully static UID/GID assignment for the users to create, it supports reading UID/GID numbers off existing files in /usr, so that vendors can make use of setuid/setgid binaries owned by specific users.

  • We also added a default user definition list which creates the most basic users the system and systemd need. Of course, very likely downstream distributions might need to alter this default list, add new entries and possibly map specific users to particular numeric UIDs.
  • A new condition ConditionNeedsUpdate= has been added. With this mechanism it is possible to conditionalize execution of services depending on whether /usr is newer than /etc or /var. The idea is that various services that need to be added into the boot process on upgrades make use of this to not delay boot-ups on normal boots, but run as necessary should /usr have been update since the last boot. This is implemented based on the mtime timestamp of the /usr: if the OS has been updated the packaging software should touch the directory, thus informing all instances that an upgrade of /etc and /var might be necessary.
  • We added a number of service files, that make use of the new ConditionNeedsUpdate= switch, and run a couple of services after each update. Among them are the aforementiond systemd-sysusers tool, as well as services that rebuild the udev hardware database, the journal catalog database and the library cache in /etc/ld.so.cache.
  • If systemd detects an empty /etc at early boot it will now use the unit preset information to enable all services by default that the vendor or packager declared. It will then proceed booting.
  • We added a new tmpfiles snippet that is able to reconstruct the most basic structure of /etc if it is missing.
  • tmpfiles also gained the ability copy entire directory trees into place should they be missing. This is particularly useful for copying certain essential files or directories into /etc without which the system refuses to boot. Currently the most prominent candidates for this are /etc/pam.d and /etc/dbus-1. In the long run we hope that packages can be fixed so that they always work correctly without configuration in /etc. Depending on the software this means that they should come with compiled-in defaults that just work should their configuration file be missing, or that they should fall back to static vendor-supplied configuration in /usr that is used whenever /etc doesn't have any configuration. Both the PAM and the D-Bus case are probably candidates for the latter. Given that there are probably many cases like this we are working with a number of folks to introduce a new directory called /usr/share/etc (name is not settled yet) to major distributions, that always contain the full, original, vendor-supplied configuration of all packages. This is very useful here, so that there's an obvious place to copy the original configuration from, but it is also useful completely independently as this provides administrators with an easy place to diff their own configuration in /etc against to see what local changes are in place.
  • We added a new --tmpfs= switch to systemd-nspawn to make testing of systems with unpopulated /etc and /var easy. For example, to run a fully state-less container, use a command line like this:

    # system-nspawn -D /srv/mycontainer --read-only --tmpfs=/var --tmpfs=/etc -b

    This command line will invoke the container tree stored in /srv/mycontainer in a read-only way, but with a (writable) tmpfs mounted to /var and /etc. With a very recent git snapshot of systemd invoking a Fedora rawhide system should mostly work OK, modulo the D-Bus and PAM problems mentioned above. A later version of systemd-nspawn is likely to gain a high-level switch --mode={stateful|volatile|stateless} that sets combines this into simple switches reusing the vocabulary introduced earlier.

What's Next

Pulling this all together we are very close to making boots with empty /etc and /var on general purpose Linux operating systems a reality. Of course, while doing the groundwork in systemd gets us some distance, there's a lot of work left. Most importantly: the majority of Linux packages are simply incomptible with this scheme the way they are currently set up. They do not work without configuration in /etc or state directories in /var; they do not drop system user information in /usr/lib/sysusers.d. However, we believe it's our job to do the groundwork, and to start somewhere.

So what does this mean for the next steps? Of course, currently very little of this is available in any distribution (simply already because 215 isn't even released yet). However, this will hopefully change quickly. As soon as that is accomplished we can start working on making the other components of the OS work nicely in this scheme. If you are an upstream developer, please consider making your software work correctly if /etc and/or /var are not populated. This means:

  • When you need a state directory in /var and it is missing, create it first. If you cannot do that, because you dropped priviliges or suchlike, please consider dropping in a tmpfiles snippet that creates the directory with the right permissions early at boot, should it be missing.
  • When you need configuration files in /etc to work properly, consider changing your application to work nicely when these files are missing, and automatically fall back to either built-in defaults, or to static vendor-supplied configuration files shipped in /usr, so that administrators can override configuration in /etc but if they don't the default configuration counts.
  • When you need a system user or group, consider dropping in a file into /usr/lib/sysusers.d describing the users. (Currently documentation on this is minimal, we will provide more docs on this shortly.)

If you are a packager, you can also help on making this all work:

  • Ask upstream to implement what we describe above, possibly even preparing a patch for this.
  • If upstream will not make these changes, then consider dropping in tmpfiles snippets that copy the bare minimum of configuration files to make your software work from somewhere in /usr into /etc.
  • Consider moving from imperative useradd commands in packaging scripts, to declarative sysusers files. Ideally, this is shipped upstream too, but if that's not possible then simply adding this to packages should be good enough.

Of course, before moving to declarative system user definitions you should consult with your distribution whether their packaging policy even allows that. Currently, most distributions will not, so we have to work to get this changed first.

Anyway, so much about what we have been working on and where we want to take this.

Conclusion

Before we finish, let me stress again why we are doing all this:

  1. For end-user machines like desktops, tablets or mobile phones, we want a generic way to implement factory reset, which the user can make use of when the system is broken (saves you support costs), or when he wants to sell it and get rid of his private data, and renew that "fresh car smell".
  2. For embedded machines we want a generic way how to reset devices. We also want a way how every single boot can be identical to a factory reset, in a stateless system design.
  3. For all kinds of systems we want to centralize vendor data in /usr so that it can be strictly read-only, and fully cryptographically verified as one unit.
  4. We want to enable new kinds of OS installers that simply deserialize a vendor OS /usr snapshot into a new file system, install a boot loader and reboot, leaving all first-time configuration to the next boot.
  5. We want to enable new kinds of OS updaters that build on this, and manage a number of vendor OS /usr snapshots in verified states, and which can then update /etc and /var simply by rebooting into a newer version.
  6. We wanto to scale container setups naturally, by sharing a single golden master /usr tree with a large number of instances that simply maintain their own private /etc and /var for their private configuration and state, while still allowing clean updates of /usr.
  7. We want to make thin clients that share /usr across the network work by allowing stateless bootups. During all discussions on how /usr was to be organized this was fequently mentioned. A setup like this so far only worked in very specific cases, with this scheme we want to make this work in general case.

Of course, we have no illusions, just doing the groundwork for all of this in systemd doesn't make this all a real-life solution yet. Also, it's very unlikely that all of Fedora (or any other general purpose distribution) will support this scheme for all its packages soon, however, we are quite confident that the idea is convincing, that we need to start somewhere, and that getting the most core packages adapted to this shouldn't be out of reach.

Oh, and of course, the concepts behind this are really not new, we know that. However, what's new here is that we try to make them available in a general purpose OS core, instead of special purpose systems.

Anyway, let's get the ball rolling! Late's make stateless systems a reality!

And that's all I have for now. I am sure this leaves a lot of questions open. If you have any, join us on IRC on #systemd on freenode or comment on Google+.


Upcoming Events

You are invited to three events:

Christoph Wickert set up a Fedora 19 Release Party here in Berlin! Please join us on Tuesday, July 2nd.

We'll have another Berlin Open Source Meetup on Sunday, July 14th.

And finally, theres' going to be another systemd Hackfest, this time colocated with GUADEC, on Tuesday/Wednesday, August 6th/7th.

See you soon!

© Lennart Poettering. Built using Pelican. Theme by Giulio Fidente on github. .