レナート   TBFKAYIBYNYAAYB   ﻟﻴﻨﺎﺭﺕ

Mon, 23 Aug 2010

systemd Status Update

It has been a while since my original announcement of systemd. Here's a little status update, on what happened since then. For simplicity's sake I'll just list here what we worked on in a bulleted list, with no particular order and without trying to cover this comprehensively:

All in all, systemd is progressing nicely, and the features we have been working on in the last months are without exception features not existing in any other of the init systems available on Linux and our feature set already was far ahead of what the older init implementations provide. And we have quite a bit planned for the future. So, stay tuned!

Also note that I'll speak about systemd at LinuxKongress 2010 in Nuremberg, Germany. Later this year I'll also be speaking at the Linux Plumbers Conference in Boston, MA. Make sure to drop by if you want to learn about systemd or discuss exiciting new ideas or features with us.

posted at: 13:32 | path: /projects | permanent link to this entry | 38 comments

systemd for Administrators, Part 1

As many of you know, systemd is the new Fedora init system, starting with F14, and it is also on its way to being adopted in a number of other distributions as well (for example, OpenSUSE). For administrators systemd provides a variety of new features and changes and enhances the administrative process substantially. This blog story is the first part of a series of articles I plan to post roughly every week for the next months. In every post I will try to explain one new feature of systemd. Many of these features are small and simple, so these stories should be interesting to a broader audience. However, from time to time we'll dive a little bit deeper into the great new features systemd provides you with.

Verifying Bootup

Traditionally, when booting up a Linux system, you see a lot of little messages passing by on your screen. As we work on speeding up and parallelizing the boot process these messages are becoming visible for a shorter and shorter time only and be less and less readable -- if they are shown at all, given we use graphical boot splash technology like Plymouth these days. Nonetheless the information of the boot screens was and still is very relevant, because it shows you for each service that is being started as part of bootup, wether it managed to start up successfully or failed (with those green or red [ OK ] or [ FAILED ] indicators). To improve the situation for machines that boot up fast and parallelized and to make this information more nicely available during runtime, we added a feature to systemd that tracks and remembers for each service whether it started up successfully, whether it exited with a non-zero exit code, whether it timed out, or whether it terminated abnormally (by segfaulting or similar), both during start-up and runtime. By simply typing systemctl in your shell you can query the state of all services, both systemd native and SysV/LSB services:

[root@lambda] ~# systemctl
UNIT                                          LOAD   ACTIVE       SUB          JOB             DESCRIPTION
dev-hugepages.automount                       loaded active       running                      Huge Pages File System Automount Point
dev-mqueue.automount                          loaded active       running                      POSIX Message Queue File System Automount Point
proc-sys-fs-binfmt_misc.automount             loaded active       waiting                      Arbitrary Executable File Formats File System Automount Point
sys-kernel-debug.automount                    loaded active       waiting                      Debug File System Automount Point
sys-kernel-security.automount                 loaded active       waiting                      Security File System Automount Point
sys-devices-pc...0000:02:00.0-net-eth0.device loaded active       plugged                      82573L Gigabit Ethernet Controller
[...]
sys-devices-virtual-tty-tty9.device           loaded active       plugged                      /sys/devices/virtual/tty/tty9
-.mount                                       loaded active       mounted                      /
boot.mount                                    loaded active       mounted                      /boot
dev-hugepages.mount                           loaded active       mounted                      Huge Pages File System
dev-mqueue.mount                              loaded active       mounted                      POSIX Message Queue File System
home.mount                                    loaded active       mounted                      /home
proc-sys-fs-binfmt_misc.mount                 loaded active       mounted                      Arbitrary Executable File Formats File System
abrtd.service                                 loaded active       running                      ABRT Automated Bug Reporting Tool
accounts-daemon.service                       loaded active       running                      Accounts Service
acpid.service                                 loaded active       running                      ACPI Event Daemon
atd.service                                   loaded active       running                      Execution Queue Daemon
auditd.service                                loaded active       running                      Security Auditing Service
avahi-daemon.service                          loaded active       running                      Avahi mDNS/DNS-SD Stack
bluetooth.service                             loaded active       running                      Bluetooth Manager
console-kit-daemon.service                    loaded active       running                      Console Manager
cpuspeed.service                              loaded active       exited                       LSB: processor frequency scaling support
crond.service                                 loaded active       running                      Command Scheduler
cups.service                                  loaded active       running                      CUPS Printing Service
dbus.service                                  loaded active       running                      D-Bus System Message Bus
getty@tty2.service                            loaded active       running                      Getty on tty2
getty@tty3.service                            loaded active       running                      Getty on tty3
getty@tty4.service                            loaded active       running                      Getty on tty4
getty@tty5.service                            loaded active       running                      Getty on tty5
getty@tty6.service                            loaded active       running                      Getty on tty6
haldaemon.service                             loaded active       running                      Hardware Manager
hdapsd@sda.service                            loaded active       running                      sda shock protection daemon
irqbalance.service                            loaded active       running                      LSB: start and stop irqbalance daemon
iscsi.service                                 loaded active       exited                       LSB: Starts and stops login and scanning of iSCSI devices.
iscsid.service                                loaded active       exited                       LSB: Starts and stops login iSCSI daemon.
livesys-late.service                          loaded active       exited                       LSB: Late init script for live image.
livesys.service                               loaded active       exited                       LSB: Init script for live image.
lvm2-monitor.service                          loaded active       exited                       LSB: Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
mdmonitor.service                             loaded active       running                      LSB: Start and stop the MD software RAID monitor
modem-manager.service                         loaded active       running                      Modem Manager
netfs.service                                 loaded active       exited                       LSB: Mount and unmount network filesystems.
NetworkManager.service                        loaded active       running                      Network Manager
ntpd.service                                  loaded maintenance  maintenance                  Network Time Service
polkitd.service                               loaded active       running                      Policy Manager
prefdm.service                                loaded active       running                      Display Manager
rc-local.service                              loaded active       exited                       /etc/rc.local Compatibility
rpcbind.service                               loaded active       running                      RPC Portmapper Service
rsyslog.service                               loaded active       running                      System Logging Service
rtkit-daemon.service                          loaded active       running                      RealtimeKit Scheduling Policy Service
sendmail.service                              loaded active       running                      LSB: start and stop sendmail
sshd@172.31.0.53:22-172.31.0.4:36368.service  loaded active       running                      SSH Per-Connection Server
sysinit.service                               loaded active       running                      System Initialization
systemd-logger.service                        loaded active       running                      systemd Logging Daemon
udev-post.service                             loaded active       exited                       LSB: Moves the generated persistent udev rules to /etc/udev/rules.d
udisks.service                                loaded active       running                      Disk Manager
upowerd.service                               loaded active       running                      Power Manager
wpa_supplicant.service                        loaded active       running                      Wi-Fi Security Service
avahi-daemon.socket                           loaded active       listening                    Avahi mDNS/DNS-SD Stack Activation Socket
cups.socket                                   loaded active       listening                    CUPS Printing Service Sockets
dbus.socket                                   loaded active       running                      dbus.socket
rpcbind.socket                                loaded active       listening                    RPC Portmapper Socket
sshd.socket                                   loaded active       listening                    sshd.socket
systemd-initctl.socket                        loaded active       listening                    systemd /dev/initctl Compatibility Socket
systemd-logger.socket                         loaded active       running                      systemd Logging Socket
systemd-shutdownd.socket                      loaded active       listening                    systemd Delayed Shutdown Socket
dev-disk-by\x1...x1db22a\x1d870f1adf2732.swap loaded active       active                       /dev/disk/by-uuid/fd626ef7-34a4-4958-b22a-870f1adf2732
basic.target                                  loaded active       active                       Basic System
bluetooth.target                              loaded active       active                       Bluetooth
dbus.target                                   loaded active       active                       D-Bus
getty.target                                  loaded active       active                       Login Prompts
graphical.target                              loaded active       active                       Graphical Interface
local-fs.target                               loaded active       active                       Local File Systems
multi-user.target                             loaded active       active                       Multi-User
network.target                                loaded active       active                       Network
remote-fs.target                              loaded active       active                       Remote File Systems
sockets.target                                loaded active       active                       Sockets
swap.target                                   loaded active       active                       Swap
sysinit.target                                loaded active       active                       System Initialization

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
JOB    = Pending job for the unit.

221 units listed. Pass --all to see inactive units, too.
[root@lambda] ~#

(I have shortened the output above a little, and removed a few lines not relevant for this blog post.)

Look at the ACTIVE column, which shows you the high-level state of a service (or in fact of any kind of unit systemd maintains, which can be more than just services, but we'll have a look on this in a later blog posting), whether it is active (i.e. running), inactive (i.e. not running) or in any other state. If you look closely you'll see one item in the list that is marked maintenance and highlighted in red. This informs you about a service that failed to run or otherwise encountered a problem. In this case this is ntpd. Now, let's find out what actually happened to ntpd, with the systemctl status command:

[root@lambda] ~# systemctl status ntpd.service
ntpd.service - Network Time Service
	  Loaded: loaded (/etc/systemd/system/ntpd.service)
	  Active: maintenance
	    Main: 953 (code=exited, status=255)
	  CGroup: name=systemd:/systemd-1/ntpd.service
[root@lambda] ~#

This shows us that NTP terminated during runtime (when it ran as PID 953), and tells us exactly the error condition: the process exited with an exit status of 255.

In a later systemd version, we plan to hook this up to ABRT, as soon as this enhancement request is fixed. Then, if systemctl status shows you information about a service that crashed it will direct you right-away to the appropriate crash dump in ABRT.

Summary: use systemctl and systemctl status as modern, more complete replacements for the traditional boot-up status messages of SysV services. systemctl status not only captures in more detail the error condition but also shows runtime errors in addition to start-up errors.

That's it for this week, make sure to come back next week, for the next posting about systemd for administrators!

posted at: 10:22 | path: /projects | permanent link to this entry | 30 comments

Thu, 19 Aug 2010

Dear Lazy Web,

does anybody know how to decode those Lenovo ThinkPad model IDs? I am interested in the T410s. For example, there's the model NUK3AGE, and there's NUHFXGE, and there's NUHYXGE. Some web sites claim NUK3AGE has Nvidia graphics, others claim VGA is Intel-only. Some web sites claim it has a touch screen, others say the contrary. The Lenovo web site isn't helpful to figure out the differences between the models and what the feature set of the various models really is. I figured out the GE suffix indicates a german keyboard, but what about the remaining code? Anybody knows how to decypher those IDs or knows a reliable source explaining their feature set?

Love,

Lennart

posted at: 13:27 | path: / | permanent link to this entry | 13 comments

Tue, 03 Aug 2010

Me too!

I too forgot to mention that my accommodation at GUADEC was sponsored by the GNOME Foundation. Thanks guys!

Sponsored

posted at: 13:22 | path: /projects | permanent link to this entry | 0 comments

Beating a Dead Horse

I guess it's a bit beating a dead horse, but I had a good laugh today when I learned that I alone contributed more to GNOME than the entirety of Canonical, and only 800 additional commits seperating me from being more awesome than Nokia.

/me is amused

posted at: 01:46 | path: /projects | permanent link to this entry | 20 comments

Mon, 02 Aug 2010

Interview With Yours Truly

Here's a podcast interview with yours truly where I speak a little about PulseAudio and systemd. Seek to 64:43 for my lovely impetuous voice. There's also an interview with Owen just before mine.

posted at: 17:13 | path: /projects | permanent link to this entry | 0 comments

Fri, 16 Jul 2010

Linux Plumbers Conference 2010 CFP Ending Soon!

The Call for Papers for the Linux Plumbers Conference (LPC) in November in Cambridge, Massachusetts is ending soon, on July 19th 2010 (That's the upcoming monday!). It's a conference about the core infrastructure of Linux systems: the part of the system where userspace and the kernel interface. It's the only conference where the focus is specifically on getting together the kernel people who work on the userspace interfaces and the userspace people who have to deal with kernel interfaces. It's supposed to be a place where all the people doing infrastructure work sit down and talk, so that both parties understand better what the requirements and needs of the other are, and where we can work towards fixing the major problems we currently have with our lower-level infrastructure and APIs.

The two previous LPCs were hugely successful (as reported on LWN on various occasions), and this time we hope to repeat that.

Like the previous years, I will be running the Audio conference track of LPC, this time together with Mark Brown. Audio infrastructure on Linux has been steadily improving the last years all over the place, but there's still a lot to do. Join us at the LPC to discuss the next steps and help improving Linux audio further! If you are doing audio infrastructure work on Linux, make sure to attend and submit a paper!

Sign up soon! Send in your paper quickly! Only three days left to the end of the CFP!

Plumbers Logo

(I am also planning to do a presentation there about systemd, together with Kay. Make sure to attend if you are interested in that topic.)

See you in Boston!

posted at: 18:35 | path: /projects | permanent link to this entry | 0 comments

Mon, 28 Jun 2010

Addendum on the Brokenness of File Locking

I forgot to mention another central problem in my blog story about file locking on Linux:

Different machines have access to different features of the same file system. Here's an example: let's say you have two machines in your home LAN. You want them to share their $HOME directory, so that you (or your family) can use either machine and have access to all your (or their) data. So you export /home on one machine via NFS and mount it from the other machine.

So far so good. But what happens to file locking now? Programs on the first machine see a fully-featured ext3 or ext4 file system, where all kinds of locking works (even though the API might suck as mentioned in the earlier blog story). But what about the other machine? If you set up lockd properly then POSIX locking will work on both. If you didn't one machine can use POSIX locking properly, the other cannot. And it gets even worse: as mentioned recent NFS implementations on Linux transparently convert client-side BSD locking into POSIX locking on the server side. Now, if the same application uses BSD locking on both the client and the server side from two instances they will end up with two orthogonal locks and although both sides think they have properly acquired a lock (and they actually did) they will overwrite each other's data, because those two locks are independent. (And one wonders why the NFS developers implemented this brokenness nonetheless...).

This basically means that locking cannot be used unless it is verified that everyone accessing a file system can make use of the same file system feature set. If you use file locking on a file system you should do so only if you are sufficiently sure that nobody using a broken or weird NFS implementation might want to access and lock those files as well. And practically that is impossible. Even if fpathconf() was improved so that it could inform the caller whether it can successfully apply a file lock to a file, this would still not give any hint if the same is true for everybody else accessing the file. But that is essential when speaking of advisory (i.e. cooperative) file locking.

And no, this isn't easy to fix. So again, the recommendation: forget about file locking on Linux, it's nothing more than a useless toy.

Also read Jeremy Allison's (Samba) take on POSIX file locking. It's an interesting read.

posted at: 19:49 | path: /projects | permanent link to this entry | 5 comments

Sat, 26 Jun 2010

On the Brokenness of File Locking

It's amazing how far Linux has come without providing for proper file locking that works and is usable from userspace. A little overview why file locking is still in a very sad state:

To begin with, there's a plethora of APIs, and all of them are awful:

The Disappointing Summary

File locking on Linux is just broken. The broken semantics of POSIX locking show that the designers of this API apparently never have tried to actually use it in real software. It smells a lot like an interface that kernel people thought makes sense but in reality doesn't when you try to use it from userspace.

Here's a list of places where you shouldn't use file locking due to the problems shown above: If you want to lock a file in $HOME, forget about it as $HOME might be NFS and locks generally are not reliable there. The same applies to every other file system that might be shared across the network. If the file you want to lock is accessible to more than your own user (i.e. an access mode > 0700), forget about locking, it would allow others to block your application indefinitely. If your program is non-trivial or threaded or uses a framework such as Gtk+ or Qt or any of the module-based APIs such as NSS, PAM, ... forget about about POSIX locking. If you care about portability, don't use file locking.

Or to turn this around, the only case where it is kind of safe to use file locking is in trivial applications where portability is not key and by using BSD locking on a file system where you can rely that it is local and on files inaccessible to others. Of course, that doesn't leave much, except for private files in /tmp for trivial user applications.

Or in one sentence: in its current state Linux file locking is unusable.

And that is a shame.

Update: Check out the follow-up story on this topic.

posted at: 19:38 | path: /projects | permanent link to this entry | 20 comments

On IDs

When programming software that cooperates with software running on behalf of other users, other sessions or other computers it is often necessary to work with unique identifiers. These can be bound to various hardware and software objects as well as lifetimes. Often, when people look for such an ID to use they pick the wrong one because semantics and lifetime or the IDs are not clear. Here's a little incomprehensive list of IDs accessible on Linux and how you should or should not use them.

Hardware IDs

  1. /sys/class/dmi/id/product_uuid: The main board product UUID, as set by the board manufacturer and encoded in the BIOS DMI information. It may be used to identify a mainboard and only the mainboard. It changes when the user replaces the main board. Also, often enough BIOS manufacturers write bogus serials into it. In addition, it is x86-specific. Access for unprivileged users is forbidden. Hence it is of little general use.
  2. CPUID/EAX=3 CPU serial number: A CPU UUID, as set by the CPU manufacturer and encoded on the CPU chip. It may be used to identify a CPU and only a CPU. It changes when the user replaces the CPU. Also, most modern CPUs don't implement this feature anymore, and older computers tend to disable this option by default, controllable via a BIOS Setup option. In addition, it is x86-specific. Hence this too is of little general use.
  3. /sys/class/net/*/address: One or more network MAC addresses, as set by the network adapter manufacturer and encoded on some network card EEPROM. It changes when the user replaces the network card. Since network cards are optional and there may be more than one the availability if this ID is not guaranteed and you might have more than one to choose from. On virtual machines the MAC addresses tend to be random. This too is hence of little general use.
  4. /sys/bus/usb/devices/*/serial: Serial numbers of various USB devices, as encoded in the USB device EEPROM. Most devices don't have a serial number set, and if they have it is often bogus. If the user replaces his USB hardware or plugs it into another machine these IDs may change or appear in other machines. This hence too is of little use.

There are various other hardware IDs available, many of which you may discover via the ID_SERIAL udev property of various devices, such hard disks and similar. They all have in common that they are bound to specific (replacable) hardware, not universally available, often filled with bogus data and random in virtualized environments. Or in other words: don't use them, don't rely on them for identification, unless you really know what you are doing and in general they do not guarantee what you might hope they guarantee.

Software IDs

  1. /proc/sys/kernel/random/boot_id: A random ID that is regenerated on each boot. As such it can be used to identify the local machine's current boot. It's universally available on any recent Linux kernel. It's a good and safe choice if you need to identify a specific boot on a specific booted kernel.
  2. gethostname(), /proc/sys/kernel/hostname: A non-random ID configured by the administrator to identify a machine in the network. Often this is not set at all or is set to some default value such as localhost and not even unique in the local network. In addition it might change during runtime, for example because it changes based on updated DHCP information. As such it is almost entirely useless for anything but presentation to the user. It has very weak semantics and relies on correct configuration by the administrator. Don't use this to identify machines in a distributed environment. It won't work unless centrally administered, which makes it useless in a globalized, mobile world. It has no place in automatically generated filenames that shall be bound to specific hosts. Just don't use it, please. It's really not what many people think it is. gethostname() is standardized in POSIX and hence portable to other Unixes.
  3. IP Addresses returned by SIOCGIFCONF or the respective Netlink APIs: These tend to be dynamically assigned and often enough only valid on local networks or even only the local links (i.e. 192.168.x.x style addresses, or even 169.254.x.x/IPv4LL). Unfortunately they hence have little use outside of networking.
  4. gethostid(): Returns a supposedly unique 32-bit identifier for the current machine. The semantics of this is not clear. On most machines this simply returns a value based on a local IPv4 address. On others it is administrator controlled via the /etc/hostid file. Since the semantics of this ID are not clear and most often is just a value based on the IP address it is almost always the wrong choice to use. On top of that 32bit are not particularly a lot. On the other hand this is standardized in POSIX and hence portable to other Unixes. It's probably best to ignore this value and if people don't want to ignore it they should probably symlink /etc/hostid to /var/lib/dbus/machine-id or something similar.
  5. /var/lib/dbus/machine-id: An ID identifying a specific Linux/Unix installation. It does not change if hardware is replaced. It is not unreliable in virtualized environments. This value has clear semantics and is considered part of the D-Bus API. It is supposedly globally unique and portable to all systems that have D-Bus. On Linux, it is universally available, given that almost all non-embedded and even a fair share of the embedded machines ship D-Bus now. This is the recommended way to identify a machine, possibly with a fallback to the host name to cover systems that still lack D-Bus. If your application links against libdbus, you may access this ID with dbus_get_local_machine_id(), if not you can read it directly from the file system.
  6. /proc/self/sessionid: An ID identifying a specific Linux login session. This ID is maintained by the kernel and part of the auditing logic. It is uniquely assigned to each login session during a specific system boot, shared by each process of a session, even across su/sudo and cannot be changed by userspace. Unfortunately some distributions have so far failed to set things up properly for this to work (Hey, you, Ubuntu!), and this ID is always (uint32_t) -1 for them. But there's hope they get this fixed eventually. Nonetheless it is a good choice for a unique session identifier on the local machine and for the current boot. To make this ID globally unique it is best combined with /proc/sys/kernel/random/boot_id.
  7. getuid(): An ID identifying a specific Unix/Linux user. This ID is usually automatically assigned when a user is created. It is not unique across machines and may be reassigned to a different user if the original user was deleted. As such it should be used only locally and with the limited validity in time in mind. To make this ID globally unique it is not sufficient to combine it with /var/lib/dbus/machine-id, because the same ID might be used for a different user that is created later with the same UID. Nonetheless this combination is often good enough. It is available on all POSIX systems.
  8. ID_FS_UUID: an ID that identifies a specific file system in the udev tree. It is not always clear how these serials are generated but this tends to be available on almost all modern disk file systems. It is not available for NFS mounts or virtual file systems. Nonetheless this is often a good way to identify a file system, and in the case of the root directory even an installation. However due to the weakly defined generation semantics the D-Bus machine ID is generally preferrable.

Generating IDs

Linux offers a kernel interface to generate UUIDs on demand, by reading from /proc/sys/kernel/random/uuid. This is a very simple interface to generate UUIDs. That said, the logic behind UUIDs is unnecessarily complex and often it is a better choice to simply read 16 bytes or so from /dev/urandom.

Summary

And the gist of it all: Use /var/lib/dbus/machine-id! Use /proc/self/sessionid! Use /proc/sys/kernel/random/boot_id! Use getuid()! Use /dev/urandom! And forget about the rest, in particular the host name, or the hardware IDs such as DMI. And keep in mind that you may combine the aforementioned IDs in various ways to get different semantics and validity constraints.

posted at: 17:02 | path: /projects | permanent link to this entry | 7 comments

Tue, 15 Jun 2010

Slides from LinuxTag 2010

On popular request, here are my (terse) slides from LinuxTag on systemd.

posted at: 14:15 | path: /projects | permanent link to this entry | 3 comments

Sat, 05 Jun 2010

Change of Plans

The upcoming week I'll do two talks at LinuxTag 2010 at the Berlin Fair Grounds. One of them was only added to the schedule today, about systemd. Systemd has never been presented in a public talk before, so make sure to attend this historic moment... ;-). Read about what has been written about systemd so far, so that you can ask the sharpest questions during my presentation.

My second talk might be about stuff a little less reported in the press, but still very interesting, about Surround Sound in Gnome.

See you at LinuxTag!

posted at: 19:25 | path: /projects | permanent link to this entry | 6 comments

Wed, 19 May 2010

Mango Lassi is Back

Mango Lassi's Icon

Sven Herzberg has recently been doing a lot of work on Mango Lassi, a project deserving love but which I as its original author haven't touched in 3 years.

His work is already bearing fruits:

Mango Lassi

Distribution packagers, please go and package his version, Mango Lassi is an awesome, wonderful tool that needs distributor love.

If you want to use Mango Lassi without waiting for the distribution packagers to catch up, Sven has built some packages for you in the OpenSUSE Build Service.

Sven, KUTGW!

posted at: 17:39 | path: /projects | permanent link to this entry | 2 comments

Tue, 11 May 2010

Name Your Threads

Stefan Kost recently pointed me to the fact that the Linux system call prctl(PR_SET_NAME) does not in fact change the process name, but the task name (comm field) -- in contrast to what the man page suggests.

That makes it very useful for naming threads, since you can read back the name you set with PR_SET_NAME earlier from the /proc file system (/proc/$PID/task/$TID/comm on newer kernels, /proc/$PID/task/$TID/stat's second field on older kernels), and hence distuingish which thread might be responsible for the high CPU load or similar problems.

So, now go, if you have a project which involves a lot of threads, name them all individually, and make it easier to debug them. What's missing now, of course, is that gdb learns this and shows the comm name when doing info threads.

I have changed PulseAudio now to name all threads it creates.

Of course, what would be even better than this is full file system extended attribute support in procfs, so that we could attach arbitrary information to processes and threads, including references to .desktop files and such.

posted at: 01:22 | path: /projects | permanent link to this entry | 4 comments

Fri, 07 May 2010

systemd Now Has a Web Site

We now have a web site, a mailing list, a bugzilla component and moved our git repositories to freedesktop.org. Make sure to update your check-outs.

For more details see our new web site.

posted at: 22:29 | path: /projects | permanent link to this entry | 0 comments

Wed, 05 May 2010

LAC Video Streams Online

The great people from the Linux Audio Conference uploaded the video streams from the event. Among them you can find my own presentation. Enjoy!

posted at: 20:22 | path: /projects | permanent link to this entry | 2 comments

PulseAudio and Jack

One thing became very clear to me during my trip to the Linux Audio Conference 2010 in Utrecht: even many pro audio folks are not sure what Jack does that PulseAudio doesn't do and what PulseAudio does that Jack doesn't do; why they are not competing, why you cannot replace one by the other, and why merging them (at least in the short term) might not make immediate sense. In other words, why millions of phones on this world run PulseAudio and not Jack, and why a music studio running PulseAudio is crack.

To light this up a bit and for future reference I'll try to explain in the following text why there is this seperation between the two systems and why this isn't necessarily bad. This is mostly a written up version of (parts of) my slides from LAC, so if you attended that event you might find little new, but I hope it is interesting nonetheless.

This is mostly written from my perspective as a hacker working on consumer audio stuff (more specifically having written most of PulseAudio), but I am sure most pro audio folks would agree with the points I raise here, and have more things to add. What I explain below is in no way comprehensive, just a list of a couple of points I think are the most important, as they touch the very core of both systems (and we ignore all the toppings here, i.e. sound effects, yadda, yadda).

First of all let's clear up the background of the sound server use cases here:

Consumer Audio (i.e. PulseAudio) Pro Audio (i.e. Jack)
Reducing power usage is a defining requirement, most systems are battery powered (Laptops, Cell Phones). Power usage usually not an issue, power comes out of the wall.
Must support latencies low enough for telephony and games. Also covers high latency uses, such as movie and music playback (2s of latency is a good choice). Minimal latencies are a definining requirement.
System is highly dynamic, with applications starting/stopping, hardware added and removed all the time. System is usually static in its configuration during operation.
User is usually not proficient in the technologies used.[1] User is usually a professional and knows audio technology and computers well.
User is not necessarily the administrator of his machine, might have limited access. User usually administrates his own machines, has root privileges.
Audio is just one use of the system among many, and often just a background job. Audio is the primary purpose of the system.
Hardware tends to have limited resources and be crappy and cheap. Hardware is powerful, expensive and high quality.

Of course, things are often not as black and white like this, there are uses that fall in the middle of these two areas.

From the table above a few conclusions may be drawn:

Jack has been designed for low latencies, where synchronous operation is advisable, meaning that a misbehaving client call stall the entire pipeline. Changes of the pipeline or latencies usually result in drop-outs in one way or the other, since the entire pipeline is reconfigured, from the hardware to the various clients. Jack only supports FLOAT32 samples and non-interleaved audio channels (and that is a good thing). Jack does not employ reference-counted zero-copy buffers. It does not try to simplify the hardware mixer in any way.

PulseAudio OTOH can deal with varying latancies, dynamically adjusting to the lowest latencies any of the connected clients needs. Client communication is fully asynchronous, a single client cannot stall the entire pipeline. PulseAudio supports a variety of PCM formats and channel setups. PulseAudio's design is heavily based on reference-counted zero-copy buffers that are passed around, even between processes, instead of the audio data itself. PulseAudio tries to simplify the hardware mixer as suggested above.

Now, the two paragraphs above hopefully show how Jack is more suitable for the pro audio use case and PulseAudio more for the consumer audio use case. One question asks itself though: can we marry the two approaches? Yes, we probably can, MacOS has a unified approach for both uses. However, it is not clear this would be a good idea. First of all, a system with the complexities introduced by sample format/channel mapping conversion, as well as dynamically changing latencies and pipelines, and asynchronous behaviour would certainly be much less attractive to pro audio developers. In fact, that Jack limits itself to synchronous, FLOAT32-only, non-interleaved-only audio streams is one of the big features of its design. Marrying the two approaches would corrupt that. A merged solution would probably not have a good stand in the community.

But it goes even further than this: what would the use case for this be? After all, most of the time, you don't want your event sounds, your Youtube, your VoIP and your Rhythmbox mixed into the new record you are producing. Hence a clear seperation between the two worlds might even be handy?

Also, let's not forget that we lack the manpower to even create such an audio chimera.

So, where to from here? Well, I think we should put the focus on cooperation instead of amalgamation: teach PulseAudio to go out of the way as soon as Jack needs access to the device, and optionally make PulseAudio a normal JACK client while both are running. That way, the user has the option to use the PulseAudio supplied streams, but normally does not see them in his pipeline. The first part of this has already been implemented: Jack2 and PulseAudio do not fight for the audio device, a friendly handover takes place. Jack takes precedence, PulseAudio takes the back seat. The second part is still missing: you still have to manually hookup PulseAudio to Jack if you are interested in its streams. If both are implemented starting Jack basically has the effect of replacing PulseAudio's core with the Jack core, while still providing full compatibility with PulseAudio clients.

And that I guess is all I have to say on the entire Jack and PulseAudio story.

Oh, one more thing, while we are at clearing things up: some news sites claim that PulseAudio's not necessarily stellar reputation in some parts of the community comes from Ubuntu and other distributions having integrated it too early. Well, let me stress here explicitly, that while they might have made a mistake or two in packaging PulseAudio and I publicly pointed that out (and probably not in a too friendly way), I do believe that the point in time they adopted it was right. Why? Basically, it's a chicken and egg problem. If it is not used in the distributions it is not tested, and there is no pressure to get fixed what then turns out to be broken: in PulseAudio itself, and in both the layers on top and below of it. Don't forget that pushing a new layer into an existing stack will break a lot of assumptions that the neighboring layers made. Doing this must break things. Most Free Software projects could probably use more developers, and that is particularly true for Audio on Linux. And given that that is how it is, pushing the feature in at that point in time was the right thing to do. Or in other words, if the features are right, and things do work correctly as far as the limited test base the developers control shows, then one day you need to push into the distributions, even if this might break setups and software that previously has not been tested, unless you want to stay stuck in your development indefinitely. So yes, Ubuntu, I think you did well with adopting PulseAudio when you did.

Footnotes

[1] Side note: yes, consumers tend not to know what dB is, and expect volume settings in "percentages", a mostly meaningless unit in audio. This even spills into projects like VLC or Amarok which expose linear volume controls (which is a really bad idea).

[2] In case you are wondering why that is the case: if the latency is low the buffers must be sized smaller. And if the buffers are sized smaller then the CPU will have to wake up more often to fill them up for the same playback time. This drives up the CPU load since less actual payload can be processed for the amount of housekeeping that the CPU has to do during each buffer iteration. Also, frequent wake-ups make it impossible for the CPU to go to deeper sleep states. Sleep states are the primary way for modern CPUs to save power.

posted at: 02:44 | path: /projects | permanent link to this entry | 0 comments

systemd In The News

A few news sites brought articles (some shorter, others longer) about last week's blog story on systemd:

Related to this, Scott's cordial reply.

And this I find funny, make sure to vote for it... ;-)

Many of the comments on those stories are quite interesting, though sometimes a little, uh..., misled... ;-)

Generally the reception of the ideas seems to be very positive. And that's certainly good news and encouraging.

posted at: 01:30 | path: /projects | permanent link to this entry | 0 comments

Fri, 30 Apr 2010

Rethinking PID 1

If you are well connected or good at reading between the lines you might already know what this blog post is about. But even then you may find this story interesting. So grab a cup of coffee, sit down, and read what's coming.

This blog story is long, so even though I can only recommend reading the long story, here's the one sentence summary: we are experimenting with a new init system and it is fun.

Here's the code. And here's the story:

Process Identifier 1

On every Unix system there is one process with the special process identifier 1. It is started by the kernel before all other processes and is the parent process for all those other processes that have nobody else to be child of. Due to that it can do a lot of stuff that other processes cannot do. And it is also responsible for some things that other processes are not responsible for, such as bringing up and maintaining userspace during boot.

Historically on Linux the software acting as PID 1 was the venerable sysvinit package, though it had been showing its age for quite a while. Many replacements have been suggested, only one of them really took off: Upstart, which has by now found its way into all major distributions.

As mentioned, the central responsibility of an init system is to bring up userspace. And a good init system does that fast. Unfortunately, the traditional SysV init system was not particularly fast.

For a fast and efficient boot-up two things are crucial:

What does that mean? Starting less means starting fewer services or deferring the starting of services until they are actually needed. There are some services where we know that they will be required sooner or later (syslog, D-Bus system bus, etc.), but for many others this isn't the case. For example, bluetoothd does not need to be running unless a bluetooth dongle is actually plugged in or an application wants to talk to its D-Bus interfaces. Same for a printing system: unless the machine physically is connected to a printer, or an application wants to print something, there is no need to run a printing daemon such as CUPS. Avahi: if the machine is not connected to a network, there is no need to run Avahi, unless some application wants to use its APIs. And even SSH: as long as nobody wants to contact your machine there is no need to run it, as long as it is then started on the first connection. (And admit it, on most machines where sshd might be listening somebody connects to it only every other month or so.)

Starting more in parallel means that if we have to run something, we should not serialize its start-up (as sysvinit does), but run it all at the same time, so that the available CPU and disk IO bandwidth is maxed out, and hence the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly dynamic in their configuration and use: they are mobile, different applications are started and stopped, different hardware added and removed again. An init system that is responsible for maintaining services needs to listen to hardware and software changes. It needs to dynamically start (and sometimes stop) services as they are needed to run a program or enable some hardware.

Most current systems that try to parallelize boot-up still synchronize the start-up of the various daemons involved: since Avahi needs D-Bus, D-Bus is started first, and only when D-Bus signals that it is ready, Avahi is started too. Similar for other services: livirtd and X11 need HAL (well, I am considering the Fedora 13 services here, ignore that HAL is obsolete), hence HAL is started first, before livirtd and X11 are started. And libvirtd also needs Avahi, so it waits for Avahi too. And all of them require syslog, so they all wait until Syslog is fully started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the serialization of a significant part of the boot process. Wouldn't it be great if we could get rid of the synchronization and serialization cost? Well, we can, actually. For that, we need to understand what exactly the daemons require from each other, and why their start-up is delayed. For traditional Unix daemons, there's one answer to it: they wait until the socket the other daemon offers its services on is ready for connections. Usually that is an AF_UNIX socket in the file-system, but it could be AF_INET[6], too. For example, clients of D-Bus wait that /var/run/dbus/system_bus_socket can be connected to, clients of syslog wait for /dev/log, clients of CUPS wait for /var/run/cups/cups.sock and NFS mounts wait for /var/run/rpcbind.sock and the portmapper IP port, and so on. And think about it, this is actually the only thing they wait for!

Now, if that's all they are waiting for, if we manage to make those sockets available for connection earlier and only actually wait for that instead of the full daemon start-up, then we can speed up the entire boot and start more processes in parallel. So, how can we do that? Actually quite easily in Unix-like systems: we can create the listening sockets before we actually start the daemon, and then just pass the socket during exec() to it. That way, we can create all sockets for all daemons in one step in the init system, and then in a second step run all daemons at once. If a service needs another, and it is not fully started up, that's completely OK: what will happen is that the connection is queued in the providing service and the client will potentially block on that single request. But only that one client will block and only on that one request. Also, dependencies between services will no longer necessarily have to be configured to allow proper parallelized start-up: if we start all sockets at once and a service needs another it can be sure that it can connect to its socket.

Because this is at the core of what is following, let me say this again, with different words and by example: if you start syslog and and various syslog clients at the same time, what will happen in the scheme pointed out above is that the messages of the clients will be added to the /dev/log socket buffer. As long as that buffer doesn't run full, the clients will not have to wait in any way and can immediately proceed with their start-up. As soon as syslog itself finished start-up, it will dequeue all messages and process them. Another example: we start D-Bus and several clients at the same time. If a synchronous bus request is sent and hence a reply expected, what will happen is that the client will have to block, however only that one client and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize parallelization, and the ordering and synchronization is done by the kernel, without any further management from userspace! And if all the sockets are available before the daemons actually start-up, dependency management also becomes redundant (or at least secondary): if a daemon needs another daemon, it will just connect to it. If the other daemon is already started, this will immediately succeed. If it isn't started but in the process of being started, the first daemon will not even have to wait for it, unless it issues a synchronous request. And even if the other daemon is not running at all, it can be auto-spawned. From the first daemon's perspective there is no difference, hence dependency management becomes mostly unnecessary or at least secondary, and all of this in optimal parallelization and optionally with on-demand loading. On top of this, this is also more robust, because the sockets stay available regardless whether the actual daemons might temporarily become unavailable (maybe due to crashing). In fact, you can easily write a daemon with this that can run, and exit (or crash), and run again and exit again (and so on), and all of that without the clients noticing or loosing any request.

It's a good time for a pause, go and refill your coffee mug, and be assured, there is more interesting stuff following.

But first, let's clear a few things up: is this kind of logic new? No, it certainly is not. The most prominent system that works like this is Apple's launchd system: on MacOS the listening of the sockets is pulled out of all daemons and done by launchd. The services themselves hence can all start up in parallel and dependencies need not to be configured for them. And that is actually a really ingenious design, and the primary reason why MacOS manages to provide the fantastic boot-up times it provides. I can highly recommend this video where the launchd folks explain what they are doing. Unfortunately this idea never really took on outside of the Apple camp.

The idea is actually even older than launchd. Prior to launchd the venerable inetd worked much like this: sockets were centrally created in a daemon that would start the actual service daemons passing the socket file descriptors during exec(). However the focus of inetd certainly wasn't local services, but Internet services (although later reimplementations supported AF_UNIX sockets, too). It also wasn't a tool to parallelize boot-up or even useful for getting implicit dependencies right.

For TCP sockets inetd was primarily used in a way that for every incoming connection a new daemon instance was spawned. That meant that for each connection a new process was spawned and initialized, which is not a recipe for high-performance servers. However, right from the beginning inetd also supported another mode, where a single daemon was spawned on the first connection, and that single instance would then go on and also accept the follow-up connections (that's what the wait/nowait option in inetd.conf was for, a particularly badly documented option, unfortunately.) Per-connection daemon starts probably gave inetd its bad reputation for being slow. But that's not entirely fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus instead of plain AF_UNIX sockets. Now, the question is, for those services, can we apply the same parallelizing boot logic as for traditional socket services? Yes, we can, D-Bus already has all the right hooks for it: using bus activation a service can be started the first time it is accessed. Bus activation also gives us the minimal per-request synchronisation we need for starting up the providers and the consumers of D-Bus services at the same time: if we want to start Avahi at the same time as CUPS (side note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we can simply run them at the same time, and if CUPS is quicker than Avahi via the bus activation logic we can get D-Bus to queue the request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the bus-based service activation together enable us to start all daemons in parallel, without any further synchronization. Activation also allows us to do lazy-loading of services: if a service is rarely used, we can just load it the first time somebody accesses the socket or bus name, instead of starting it during boot.

And if that's not great, then I don't know what is great!

Parallelizing File System Jobs

If you look at the serialization graphs of the boot process of current distributions, there are more synchronisation points than just daemon start-ups: most prominently there are file-system related jobs: mounting, fscking, quota. Right now, on boot-up a lot of time is spent idling to wait until all devices that are listed in /etc/fstab show up in the device tree and are then fsck'ed, mounted, quota checked (if enabled). Only after that is fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is interested in another service, an open() (or a similar call) shows that a service is interested in a specific file or file-system. So, in order to improve how much we can parallelize we can make those apps wait only if a file-system they are looking for is not yet mounted and readily available: we set up an autofs mount point, and then when our file-system finished fsck and quota due to normal boot-up we replace it by the real mount. While the file-system is not ready yet, the access will be queued by the kernel and the accessing process will block, but only that one daemon and only that one access. And this way we can begin starting our daemons even before all file systems have been fully made available -- without them missing any files, and maximizing parallelization.

Parallelizing file system jobs and service jobs does not make sense for /, after all that's where the service binaries are usually stored. However, for file-systems such as /home, that usually are bigger, even encrypted, possibly remote and seldom accessed by the usual boot-up daemons, this can improve boot time considerably. It is probably not necessary to mention this, but virtual file systems, such as procfs or sysfs should never be mounted via autofs.

I wouldn't be surprised if some readers might find integrating autofs in an init system a bit fragile and even weird, and maybe more on the "crackish" side of things. However, having played around with this extensively I can tell you that this actually feels quite right. Using autofs here simply means that we can create a mount point without having to provide the backing file system right-away. In effect it hence only delays accesses. If an application tries to access an autofs file-system and we take very long to replace it with the real file-system, it will hang in an interruptible sleep, meaning that you can safely cancel it, for example via C-c. Also note that at any point, if the mount point should not be mountable in the end (maybe because fsck failed), we can just tell autofs to return a clean error code (like ENOENT). So, I guess what I want to say is that even though integrating autofs into an init system might appear adventurous at first, our experimental code has shown that this idea works surprisingly well in practice -- if it is done for the right reasons and the right way.

Also note that these should be direct autofs mounts, meaning that from an application perspective there's little effective difference between a classic mount point and one based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is that shell scripts are evil. Shell is fast and shell is slow. It is fast to hack, but slow in execution. The classic sysvinit boot logic is modelled around shell scripts. Whether it is /bin/bash or any other shell (that was written to make shell scripts faster), in the end the approach is doomed to be slow. On my system the scripts in /etc/init.d call grep at least 77 times. awk is called 92 times, cut 23 and sed 74. Every time those commands (and others) are called, a process is spawned, the libraries searched, some start-up stuff like i18n and so on set up and more. And then after seldom doing more than a trivial string operation the process is terminated again. Of course, that has to be incredibly slow. No other language but shell would do something like that. On top of that, shell scripts are also very fragile, and change their behaviour drastically based on environment variables and suchlike, stuff that is hard to oversee and control.

So, let's get rid of shell scripts in the boot process! Before we can do that we need to figure out what they are currently actually used for: well, the big picture is that most of the time, what they do is actually quite boring. Most of the scripting is spent on trivial setup and tear-down of services, and should be rewritten in C, either in separate executables, or moved into the daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during system boot-up entirely anytime soon. Rewriting them in C takes time, in a few case does not really make sense, and sometimes shell scripts are just too handy to do without. But we can certainly make them less prominent.

A good metric for measuring shell script infestation of the boot process is the PID number of the first process you can start after the system is fully booted up. Boot up, log in, open a terminal, and type echo $$. Try that on your Linux system, and then compare the result with MacOS! (Hint, it's something like this: Linux PID 1823; MacOS PID 154, measured on test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains services should be process babysitting: it should watch services. Restart them if they shut down. If they crash it should collect information about them, and keep it around for the administrator, and cross-link that information with what is available from crash dump systems such as abrt, and in logging systems like syslog or the audit system.

It should also be capable of shutting down a service completely. That might sound easy, but is harder than you think. Traditionally on Unix a process that does double-forking can escape the supervision of its parent, and the old parent will not learn about the relation of the new process to the one it actually started. An example: currently, a misbehaving CGI script that has double-forked is not terminated when you shut down Apache. Furthermore, you will not even be able to figure out its relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot escape the babysitter, and that we can control them as one unit even if they fork a gazillion times?

Different people came up with different solutions for this. I am not going into much detail here, but let's at least say that approaches based on ptrace or the netlink connector (a kernel interface which allows you to get a netlink message each time any process on the system fork()s or exit()s) that some people have investigated and implemented, have been criticised as ugly and not very scalable.

So what can we do about this? Well, since quite a while the kernel knows Control Groups (aka "cgroups"). Basically they allow the creation of a hierarchy of groups of processes. The hierarchy is directly exposed in a virtual file-system, and hence easily accessible. The group names are basically directory names in that file-system. If a process belonging to a specific cgroup fork()s, its child will become a member of the same group. Unless it is privileged and has access to the cgroup file system it cannot escape its group. Originally, cgroups have been introduced into the kernel for the purpose of containers: certain kernel subsystems can enforce limits on resources of certain groups, such as limiting CPU or memory usage. Traditional resource limits (as implemented by setrlimit()) are (mostly) per-process. cgroups on the other hand let you enforce limits on entire groups of processes. cgroups are also useful to enforce limits outside of the immediate container use case. You can use it for example to limit the total amount of memory or CPU Apache and all its children may use. Then, a misbehaving CGI script can no longer escape your setrlimit() resource control by simply forking away.

In addition to container and resource limit enforcement cgroups are very useful to keep track of daemons: cgroup membership is securely inherited by child processes, they cannot escape. There's a notification system available so that a supervisor process can be notified when a cgroup runs empty. You can find the cgroups of a process by reading /proc/$PID/cgroup. cgroups hence make a very good choice to keep track of processes for babysitting purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a daemon starts, ends or crashes, but also set up a good, minimal, and secure working environment for it.

That means setting obvious process parameters such as the setrlimit() resource limits, user/group IDs or the environment block, but does not end there. The Linux kernel gives users and administrators a lot of control over processes (some of it is rarely used, currently). For each process you can set CPU and IO scheduler controls, the capability bounding set, CPU affinity or of course cgroup environments with additional limits, and more.

As an example, ioprio_set() with IOPRIO_CLASS_IDLE is a great away to minimize the effect of locate's updatedb on system interactivity.

On top of that certain high-level controls can be very useful, such as setting up read-only file system overlays based on read-only bind mounts. That way one can run certain daemons so that all (or some) file systems appear read-only to them, so that EROFS is returned on every write request. As such this can be used to lock down what daemons can do similar in fashion to a poor man's SELinux policy system (but this certainly doesn't replace SELinux, don't get any bad ideas, please).

Finally logging is an important part of executing services: ideally every bit of output a service generates should be logged away. An init system should hence provide logging to daemons it spawns right from the beginning, and connect stdout and stderr to syslog or in some cases even /dev/kmsg which in many cases makes a very useful replacement for syslog (embedded folks, listen up!), especially in times where the kernel log buffer is configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code of Upstart, it is very well commented and easy to follow. It's certainly something other projects should learn from (including my own).

That being said, I can't say I agree with the general approach of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its functionality is a super-set of it, and provides compatibility to some degree with the well known SysV init scripts. It's main feature is its event-based approach: starting and stopping of processes is bound to "events" happening in the system, where an "event" can be a lot of different things, such as: a network interfaces becomes available or some other software has been started.

Upstart does service serialization via these events: if the syslog-started event is triggered this is used as an indication to start D-Bus since it can now make use of Syslog. And then, when dbus-started is triggered, NetworkManager is started, since it may now use D-Bus, and so on.

One could say that this way the actual logical dependency tree that exists and is understood by the admin or developer is translated and encoded into event and action rules: every logical "a needs b" rule that the administrator/developer is aware of becomes a "start a when b is started" plus "stop a when b is stopped". In some way this certainly is a simplification: especially for the code in Upstart itself. However I would argue that this simplification is actually detrimental. First of all, the logical dependency system does not go away, the person who is writing Upstart files must now translate the dependencies manually into these event/action rules (actually, two rules for each dependency). So, instead of letting the computer figure out what to do based on the dependencies, the user has to manually translate the dependencies into simple event/action rules. Also, because the dependency information has never been encoded it is not available at runtime, effectively meaning that an administrator who tries to figure our why something happened, i.e. why a is started when b is started, has no chance of finding that out.

Furthermore, the event logic turns around all dependencies, from the feet onto their head. Instead of minimizing the amount of work (which is something that a good init system should focus on, as pointed out in the beginning of this blog story), it actually maximizes the amount of work to do during operations. Or in other words, instead of having a clear goal and only doing the things it really needs to do to reach the goal, it does one step, and then after finishing it, it does all steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus is in no way an indication that NetworkManager should be started too (but this is what Upstart would do). It's right the other way round: when the user asks for NetworkManager, that is definitely an indication that D-Bus should be started too (which is certainly what most users would expect, right?).

A good init system should start only what is needed, and that on-demand. Either lazily or parallelized and in advance. However it should not start more than necessary, particularly not everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event logic. It appears to me that most events that are exposed in Upstart actually are not punctual in nature, but have duration: a service starts, is running, and stops. A device is plugged in, is available, and is plugged out again. A mount point is in the process of being mounted, is fully mounted, or is being unmounted. A power plug is plugged in, the system runs on AC, and the power plug is pulled. Only a minority of the events an init system or process supervisor should handle are actually punctual, most of them are tuples of start, condition, and stop. This information is again not available in Upstart, because it focuses in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are in some way mitigated by certain more recent changes in Upstart, particularly condition based syntaxes such as start on (local-filesystems and net-device-up IFACE=lo) in Upstart rule files. However, to me this appears mostly as an attempt to fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though some choices might be questionable (see above), and there are certainly a lot of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and launchd. Most of them offer little substantial more than Upstart or sysvinit. The most interesting other contender is Solaris SMF, which supports proper dependencies between services. However, in many ways it is overly complex and, let's say, a bit academic with its excessive use of XML and new terminology for known things. It is also closely bound to Solaris specific features such as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because after I have hopefully explained above what I think a good PID 1 should be doing and what the current most used system does, we'll now come to where the beef is. So, go and refill you coffee mug again. It's going to be worth it.

You probably guessed it: what I suggested above as requirements and features for an ideal init system is actually available now, in a (still experimental) init system called systemd, and which I hereby want to announce. Again, here's the code. And here's a quick rundown of its features, and the rationale behind them:

systemd starts up and supervises the entire system (hence the name...). It implements all of the features pointed out above and a few more. It is based around the notion of units. Units have a name and a type. Since their configuration is usually loaded directly from the file system, these unit names are actually file names. Example: a unit avahi.service is read from a configuration file by the same name, and of course could be a unit encapsulating the Avahi daemon. There are several kinds of units:

  1. service: these are the most obvious kind of unit: daemons that can be started, stopped, restarted, reloaded. For compatibility with SysV we not only support our own configuration files for services, but also are able to read classic SysV init scripts, in particular we parse the LSB header, if it exists. /etc/init.d is hence not much more than just another source of configuration.
  2. socket: this unit encapsulates a socket in the file-system or on the Internet. We currently support AF_INET, AF_INET6, AF_UNIX sockets of the types stream, datagram, and sequential packet. We also support classic FIFOs as transport. Each socket unit has a matching service unit, that is started if the first connection comes in on the socket or FIFO. Example: nscd.socket starts nscd.service on an incoming connection.
  3. device: this unit encapsulates a device in the Linux device tree. If a device is marked for this via udev rules, it will be exposed as a device unit in systemd. Properties set with udev can be used as configuration source to set dependencies for device units.
  4. mount: this unit encapsulates a mount point in the file system hierarchy. systemd monitors all mount points how they come and go, and can also be used to mount or unmount mount-points. /etc/fstab is used here as an additional configuration source for these mount points, similar to how SysV init scripts can be used as additional configuration source for service units.
  5. automount: this unit type encapsulates an automount point in the file system hierarchy. Each automount unit has a matching mount unit, which is started (i.e. mounted) as soon as the automount directory is accessed.
  6. target: this unit type is used for logical grouping of units: instead of actually doing anything by itself it simply references other units, which thereby can be controlled together. Examples for this are: multi-user.target, which is a target that basically plays the role of run-level 5 on classic SysV system, or bluetooth.target which is requested as soon as a bluetooth dongle becomes available and which simply pulls in bluetooth related services that otherwise would not need to be started: bluetoothd and obexd and suchlike.
  7. snapshot: similar to target units snapshots do not actually do anything themselves and their only purpose is to reference other units. Snapshots can be used to save/rollback the state of all services and units of the init system. Primarily it has two intended use cases: to allow the user to temporarily enter a specific state such as "Emergency Shell", terminating current services, and provide an easy way to return to the state before, pulling up all services again that got temporarily pulled down. And to ease support for system suspending: still many services cannot correctly deal with system suspend, and it is often a better idea to shut them down before suspend, and restore them afterwards.

All these units can have dependencies between each other (both positive and negative, i.e. 'Requires' and 'Conflicts'): a device can have a dependency on a service, meaning that as soon as a device becomes available a certain service is started. Mounts get an implicit dependency on the device they are mounted from. Mounts also gets implicit dependencies to mounts that are their prefixes (i.e. a mount /home/lennart implicitly gets a dependency added to the mount for /home) and so on.

A short list of other features:

  1. For each process that is spawned, you may control: the environment, resource limits, working and root directory, umask, OOM killer adjustment, nice level, IO class and priority, CPU policy and priority, CPU affinity, timer slack, user id, group id, supplementary group ids, readable/writable/inaccessible directories, shared/private/slave mount flags, capabilities/bounding set, secure bits, CPU scheduler reset of fork, private /tmp name-space, cgroup control for various subsystems. Also, you can easily connect stdin/stdout/stderr of services to syslog, /dev/kmsg, arbitrary TTYs. If connected to a TTY for input systemd will make sure a process gets exclusive access, optionally waiting or enforcing it.
  2. Every executed process gets its own cgroup (currently by default in the debug subsystem, since that subsystem is not otherwise used and does not much more than the most basic process grouping), and it is very easy to configure systemd to place services in cgroups that have been configured externally, for example via the libcgroups utilities.
  3. The native configuration files use a syntax that closely follows the well-known .desktop files. It is a simple syntax for which parsers exist already in many software frameworks. Also, this allows us to rely on existing tools for i18n for service descriptions, and similar. Administrators and developers don't need to learn a new syntax.
  4. As mentioned, we provide compatibility with SysV init scripts. We take advantages of LSB and Red Hat chkconfig headers if they are available. If they aren't we try to make the best of the otherwise available information, such as the start priorities in /etc/rc.d. These init scripts are simply considered a different source of configuration, hence an easy upgrade path to proper systemd services is available. Optionally we can read classic PID files for services to identify the main pid of a daemon. Note that we make use of the dependency information from the LSB init script headers, and translate those into native systemd dependencies. Side note: Upstart is unable to harvest and make use of that information. Boot-up on a plain Upstart system with mostly LSB SysV init scripts will hence not be parallelized, a similar system running systemd however will. In fact, for Upstart all SysV scripts together make one job that is executed, they are not treated individually, again in contrast to systemd where SysV init scripts are just another source of configuration and are all treated and controlled individually, much like any other native systemd service.
  5. Similarly, we read the existing /etc/fstab configuration file, and consider it just another source of configuration. Using the comment= fstab option you can even mark /etc/fstab entries to become systemd controlled automount points.
  6. If the same unit is configured in multiple configuration sources (e.g. /etc/systemd/system/avahi.service exists, and /etc/init.d/avahi too), then the native configuration will always take precedence, the legacy format is ignored, allowing an easy upgrade path and packages to carry both a SysV init script and a systemd service file for a while.
  7. We support a simple templating/instance mechanism. Example: instead of having six configuration files for six gettys, we only have one getty@.service file which gets instantiated to getty@tty2.service and suchlike. The interface part can even be inherited by dependency expressions, i.e. it is easy to encode that a service dhcpcd@eth0.service pulls in avahi-autoipd@eth0.service, while leaving the eth0 string wild-carded.
  8. For socket activation we support full compatibility with the traditional inetd modes, as well as a very simple mode that tries to mimic launchd socket activation and is recommended for new services. The inetd mode only allows passing one socket to the started daemon, while the native mode supports passing arbitrary numbers of file descriptors. We also support one instance per connection, as well as one instance for all connections modes. In the former mode we name the cgroup the daemon will be started in after the connection parameters, and utilize the templating logic mentioned above for this. Example: sshd.socket might spawn services sshd@192.168.0.1-4711-192.168.0.2-22.service with a cgroup of sshd@.service/192.168.0.1-4711-192.168.0.2-22 (i.e. the IP address and port numbers are used in the instance names. For AF_UNIX sockets we use PID and user id of the connecting client). This provides a nice way for the administrator to identify the various instances of a daemon and control their runtime individually. The native socket passing mode is very easily implementable in applications: if $LISTEN_FDS is set it contains the number of sockets passed and the daemon will find them sorted as listed in the .service file, starting from file descriptor 3 (a nicely written daemon could also use fstat() and getsockname() to identify the sockets in case it receives more than one). In addition we set $LISTEN_PID to the PID of the daemon that shall receive the fds, because environment variables are normally inherited by sub-processes and hence could confuse processes further down the chain. Even though this socket passing logic is very simple to implement in daemons, we will provide a BSD-licensed reference implementation that shows how to do this. We have ported a couple of existing daemons to this new scheme.
  9. We provide compatibility with /dev/initctl to a certain extent. This compatibility is in fact implemented with a FIFO-activated service, which simply translates these legacy requests to D-Bus requests. Effectively this means the old shutdown, poweroff and similar commands from Upstart and sysvinit continue to work with systemd.
  10. We also provide compatibility with utmp and wtmp. Possibly even to an extent that is far more than healthy, given how crufty utmp and wtmp are.
  11. systemd supports several kinds of dependencies between units. After/Before can be used to fix the ordering how units are activated. It is completely orthogonal to Requires and Wants, which express a positive requirement dependency, either mandatory, or optional. Then, there is Conflicts which expresses a negative requirement dependency. Finally, there are three further, less used dependency types.
  12. systemd has a minimal transaction system. Meaning: if a unit is requested to start up or shut down we will add it and all its dependencies to a temporary transaction. Then, we will verify if the transaction is consistent (i.e. whether the ordering via After/Before of all units is cycle-free). If it is not, systemd will try to fix it up, and removes non-essential jobs from the transaction that might remove the loop. Also, systemd tries to suppress non-essential jobs in the transaction that would stop a running service. Non-essential jobs are those which the original request did not directly include but which where pulled in by Wants type of dependencies. Finally we check whether the jobs of the transaction contradict jobs that have already been queued, and optionally the transaction is aborted then. If all worked out and the transaction is consistent and minimized in its impact it is merged with all already outstanding jobs and added to the run queue. Effectively this means that before executing a requested operation, we will verify that it makes sense, fixing it if possible, and only failing if it really cannot work.
  13. We record start/exit time as well as the PID and exit status of every process we spawn and supervise. This data can be used to cross-link daemons with their data in abrtd, auditd and syslog. Think of an UI that will highlight crashed daemons for you, and allows you to easily navigate to the respective UIs for syslog, abrt, and auditd that will show the data generated from and for this daemon on a specific run.
  14. We support reexecution of the init process itself at any time. The daemon state is serialized before the reexecution and deserialized afterwards. That way we provide a simple way to facilitate init system upgrades as well as handover from an initrd daemon to the final daemon. Open sockets and autofs mounts are properly serialized away, so that they stay connectible all the time, in a way that clients will not even notice that the init system reexecuted itself. Also, the fact that a big part of the service state is encoded anyway in the cgroup virtual file system would even allow us to resume execution without access to the serialization data. The reexecution code paths are actually mostly the same as the init system configuration reloading code paths, which guarantees that reexecution (which is probably more seldom triggered) gets similar testing as reloading (which is probably more common).
  15. Starting the work of removing shell scripts from the boot process we have recoded part of the basic system setup in C and moved it directly into systemd. Among that is mounting of the API file systems (i.e. virtual file systems such as /proc, /sys and /dev.) and setting of the host-name.
  16. Server state is introspectable and controllable via D-Bus. This is not complete yet but quite extensive.
  17. While we want to emphasize socket-based and bus-name-based activation, and we hence support dependencies between sockets and services, we also support traditional inter-service dependencies. We support multiple ways how such a service can signal its readiness: by forking and having the start process exit (i.e. traditional daemonize() behaviour), as well as by watching the bus until a configured service name appears.
  18. There's an interactive mode which asks for confirmation each time a process is spawned by systemd. You may enable it by passing systemd.confirm_spawn=1 on the kernel command line.
  19. With the systemd.default= kernel command line parameter you can specify which unit systemd should start on boot-up. Normally you'd specify something like multi-user.target here, but another choice could even be a single service instead of a target, for example out-of-the-box we ship a service emergency.service that is similar in its usefulness as init=/bin/bash, however has the advantage of actually running the init system, hence offering the option to boot up the full system from the emergency shell.
  20. There's a minimal UI that allows you to start/stop/introspect services. It's far from complete but useful as a debugging tool. It's written in Vala (yay!) and goes by the name of systemadm.

It should be noted that systemd uses many Linux-specific features, and does not limit itself to POSIX. That unlocks a lot of functionality a system that is designed for portability to other operating systems cannot provide.

Status

All the features listed above are already implemented. Right now systemd can already be used as a drop-in replacement for Upstart and sysvinit (at least as long as there aren't too many native upstart services yet. Thankfully most distributions don't carry too many native Upstart services yet.)

However, testing has been minimal, our version number is currently at an impressive 0. Expect breakage if you run this in its current state. That said, overall it should be quite stable and some of us already boot their normal development systems with systemd (in contrast to VMs only). YMMV, especially if you try this on distributions we developers don't use.

Where is This Going?

The feature set described above is certainly already comprehensive. However, we have a few more things on our plate. I don't really like speaking too much about big plans but here's a short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap shall be used to control swap devices the same way we already control mounts, i.e. with automatic dependencies on the device tree devices they are activated from, and suchlike. timer shall provide functionality similar to cron, i.e. starts services based on time events, the focus being both monotonic clock and wall-clock/calendar events. (i.e. "start this 5h after it last ran" as well as "start this every monday 5 am")

More importantly however, it is also our plan to experiment with systemd not only for optimizing boot times, but also to make it the ideal session manager, to replace (or possibly just augment) gnome-session, kdeinit and similar daemons. The problem set of a session manager and an init system are very similar: quick start-up is essential and babysitting processes the focus. Using the same code for both uses hence suggests itself. Apple recognized that and does just that with launchd. And so should we: socket and bus based activation and parallelization is something session services and system services can benefit from equally.

I should probably note that all three of these features are already partially available in the current code base, but not complete yet. For example, already, you can run systemd just fine as a normal user, and it will detect that is run that way and support for this mode has been available since the very beginning, and is in the very core. (It is also exceptionally useful for debugging! This works fine even without having the system otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the kernel and elsewhere before finishing work on this: we need swap status change notifications from the kernel similar to how we can already subscribe to mount changes; we want a notification when CLOCK_REALTIME jumps relative to CLOCK_MONOTONIC; we want to allow normal processes to get some init-like powers; we need a well-defined place where we can put user sockets. None of these issues are really essential for systemd, but they'd certainly improve things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be straightforward to check out the code from our repository. In addition, to have something to start with, here's a tarball with unit configuration files that allows an otherwise unmodified Fedora 13 system to work with systemd. We have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which has been prepared for systemd. In the grub menu you can select whether you want to boot the system with Upstart or systemd. Note that this system is minimally modified only. Service information is read exclusively from the existing SysV init scripts. Hence it will not take advantage of the full socket and bus-based parallelization pointed out above, however it will interpret the parallelization hints from the LSB headers, and hence boots faster than the Upstart system, which in Fedora does not employ any parallelization at the moment. The image is configured to output debug information on the serial console, as well as writing it to the kernel log buffer (which you may access with dmesg). You might want to run qemu configured with a virtual serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is looking at pretty screen-shots. Since an init system usually is well hidden beneath the user interface, some shots of systemadm and ps must do:

systemadm

That's systemadm showing all loaded units, with more detailed information on one of the getty instances.

ps

That's an excerpt of the output of ps xaf -eo pid,user,args,cgroup showing how neatly the processes are sorted into the cgroup of their service. (The fourth column is the cgroup, the debug: prefix is shown because we use the debug cgroup controller for systemd, as mentioned earlier. This is only temporary.)

Note that both of these screenshots show an only minimally modified Fedora 13 Live CD installation, where services are exclusively loaded from the existing SysV init scripts. Hence, this does not use socket or bus activation for any existing service.

Sorry, no bootcharts or hard data on start-up times for the moment. We'll publish that as soon as we have fully parallelized all services from the default Fedora install. Then, we'll welcome you to benchmark the systemd approach, and provide our own benchmark data as well.

Well, presumably everybody will keep bugging me about this, so here are two numbers I'll tell you. However, they are completely unscientific as they are measured for a VM (single CPU) and by using the stop timer in my watch. Fedora 13 booting up with Upstart takes 27s, with systemd we reach 24s (from grub to gdm, same system, same settings, shorter value of two bootups, one immediately following the other). Note however that this shows nothing more than the speedup effect reached by using the LSB dependency information parsed from the init script headers for parallelization. Socket or bus based activation was not utilized for this, and hence these numbers are unsuitable to assess the ideas pointed out above. Also, systemd was set to debug verbosity levels on a serial console. So again, this benchmark data has barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things differently then things were traditionally done. Later on, we will publish a longer guide explaining and suggesting how to write a daemon for use with this systemd. Basically, things get simpler for daemon developers:

The list above is very similar to what Apple recommends for daemons compatible with launchd. It should be easy to extend daemons that already support launchd activation to support systemd activation as well.

Note that systemd supports daemons not written in this style perfectly as well, already for compatibility reasons (launchd has only limited support for that). As mentioned, this even extends to existing inetd capable daemons which can be used unmodified for socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get adopted by the distributions it would make sense to port at least those services that are started by default to use socket or bus-based activation. We have written proof-of-concept patches, and the porting turned out to be very easy. Also, we can leverage the work that has already been done for launchd, to a certain extent. Moreover, adding support for socket-based activation does not make the service incompatible with non-systemd systems.

FAQs

Who's behind this?
Well, the current code-base is mostly my work, Lennart Poettering (Red Hat). However the design in all its details is result of close cooperation between Kay Sievers (Novell) and me. Other people involved are Harald Hoyer (Red Hat), Dhaval Giani (Formerly IBM), and a few others from various companies such as Intel, SUSE and Nokia.
Is this a Red Hat project?
No, this is my personal side project. Also, let me emphasize this: the opinions reflected here are my own. They are not the views of my employer, or Ronald McDonald, or anyone else.
Will this come to Fedora?
If our experiments prove that this approach works out, and discussions in the Fedora community show support for this, then yes, we'll certainly try to get this into Fedora.
Will this come to OpenSUSE?
Kay's pursuing that, so something similar as for Fedora applies here, too.
Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?
That's up to them. We'd certainly welcome their interest, and help with the integration.
Why didn't you just add this to Upstart, why did you invent something new?
Well, the point of the part about Upstart above was to show that the core design of Upstart is flawed, in our opinion. Starting completely from scratch suggests itself if the existing solution appears flawed in its core. However, note that we took a lot of inspiration from Upstart's code-base otherwise.
If you love Apple launchd so much, why not adopt that?
launchd is a great invention, but I am not convinced that it would fit well into Linux, nor that it is suitable for a system like Linux with its immense scalability and flexibility to numerous purposes and uses.
Is this an NIH project?
Well, I hope that I managed to explain in the text above why we came up with something new, instead of building on Upstart or launchd. We came up with systemd due to technical reasons, not political reasons.
Don't forget that it is Upstart that includes a library called NIH (which is kind of a reimplementation of glib) -- not systemd!
Will this run on [insert non-Linux OS here]?
Unlikely. As pointed out, systemd uses many Linux specific APIs (such as epoll, signalfd, libudev, cgroups, and numerous more), a port to other operating systems appears to us as not making a lot of sense. Also, we, the people involved are unlikely to be interested in merging possible ports to other platforms and work with the constraints this introduces. That said, git supports branches and rebasing quite well, in case people really want to do a port.
Actually portability is even more limited than just to other OSes: we require a very recent Linux kernel, glibc, libcgroup and libudev. No support for less-than-current Linux systems, sorry.
If folks want to implement something similar for other operating systems, the preferred mode of cooperation is probably that we help you identify which interfaces can be shared with your system, to make life easier for daemon writers to support both systemd and your systemd counterpart. Probably, the focus should be to share interfaces, not code.
I hear [fill one in here: the Gentoo boot system, initng, Solaris SMF, runit, uxlaunch, ...] is an awesome init system and also does parallel boot-up, so why not adopt that?
Well, before we started this we actually had a very close look at the various systems, and none of them did what we had in mind for systemd (with the exception of launchd, of course). If you cannot see that, then please read again what I wrote above.

Contributions

We are very interested in patches and help. It should be common sense that every Free Software project can only benefit from the widest possible external contributions. That is particularly true for a core part of the OS, such as an init system. We value your contributions and hence do not require copyright assignment (Very much unlike Canonical/Upstart!). And also, we use git, everybody's favourite VCS, yay!

We are particularly interested in help getting systemd to work on other distributions, besides Fedora and OpenSUSE. (Hey, anybody from Debian, Gentoo, Mandriva, MeeGo looking for something to do?) But even beyond that we are keen to attract contributors on every level: we welcome C hackers, packagers, as well as folks who are interested to write documentation, or contribute a logo.

Community

At this time we only have source code repository and an IRC channel (#systemd on Freenode). There's no mailing list, web site or bug tracking system. We'll probably set something up on freedesktop.org soon. If you have any questions or want to contact us otherwise we invite you to join us on IRC!

Update: our GIT repository has moved.

posted at: 10:46 | path: /projects | permanent link to this entry | 174 comments

Tue, 20 Apr 2010

A Few Notes on Bloom Filters

For future reference (mostly for myself), here's a little summary of how to use Bloom filters in real world applications.

Most references are terse and vague on how to pick the hash functions for bloom filters, so here's some detail about that: For small filters, just use a boring and fast hash function like the djb hash function and split up the 32bit result into smaller independent chunks for each of the k hash indexes you'll need. Often those 32 bits already provide enough hash bits to get enough independent bloom filter indexes. And if they don't you basically have three options:

The size of the bloom filter and the number of hash functions you should be using depending on your application can be calculated using the formulas on the Wikipedia page:

m = -n*ln(p)/(ln(2)^2)

This will tell you the number of bits m to use for your filter, given the number n of elements in your filter and the false positive probability p you want to achieve. All that for the ideal number of hash functions k which you can calculate like this:

k = 0.7*m/n

And that's already everything you need to know to build good bloom filters. If you know the p and n for your use case the above will tell you the m and k, and how to choose the k hash functions.

Bloom filters are a really really useful tool, and given their simplicity something every developer should be aware of.

(And in case you were wondering what this all is about, Kay Sievers and I were discussing using bloom filters in the libudev netlink BSD socket filters, to allow monitoring a certain subset of devices that is orthogonal to the usual subsystem hierarchy, and all that in a way where the number of wakeups in listening clients is minimized)

posted at: 22:01 | path: /projects | permanent link to this entry | 1 comments

Sun, 04 Apr 2010

Down the Amazon II

As a followup to this blog story here are a couple of non-panorama shots from the trip:

Image 206 Image 144 Image 142 Image 113 Image 96 Image 897 Image 91

Image 751 Image 112 Image 72 Image 131 Image 233 Image 488 Image 249

Image 272 Image 356 Image 393 Image 234 Image 435 Image 450 Image 485

Image 60 Image 502 Image 753 Image 822 Image 951 Image 960 Image 85

Image 199 Image 653 Image 194 Image 164 Image 89 Image 231 Image 240 Image 263 Image 685

Image 331 Image 334 Image 337 Image 389 Image 537 Image 570 Image 582 Image 197 Image 655

Image 660 Image 108 Image 697 Image 710 Image 747 Image 705 Image 776 Image 832

posted at: 00:59 | path: /photos | permanent link to this entry | 0 comments

Sat, 03 Apr 2010

Down the Amazon

After BOSSA in Manaus/Brazil we took a very enjoyable boat trip down the Amazon, to Santarém and particularly Alter do Chão, a ridiculously amazing island paradise with glaring white sand in the middle of the jungle:

Tapajos 2

The town is located on the Tapajós River:

Tapajos 1

Tapajos 3

Up the river you find the Tapajós National Forest:

Tapajos 4

From there we went on to São Luís, a beautiful old colonial town:

Sao Luis 1

Sao Luis 3

Sao Luis 4

Sao Luis 2

A windy and wet sailing catamaran ride from São Luís you find Alcântara, another old colonial town, now partly in ruins and deserted:

Alcantara 1

Alcantara 2

Alcantara 3

posted at: 22:25 | path: /photos | permanent link to this entry | 5 comments

Tue, 23 Mar 2010

Public Service Announcement: Beware of rsvg_term()!

As a short followup on an older blog posting of mine:

So you are using librsvg's rsvg_term() in your code? If so then you are probably misusing it and triggering crashes in PulseAudio related code. The same way everybody should stop using libxml2's xmlCleanupParser() call, stop using rsvg_term()! It's really hard to use it correctly, and uneeded anyway. Also see this bug report.

posted at: 21:29 | path: /projects | permanent link to this entry | 5 comments

Tue, 09 Mar 2010

Bossa 2010/Manaus Slides

The slides for my talk about the audio infrastructure of Linux mobile devices at BOSSA 2010 in Manaus/Brazil are now available online. They are terse (as usual), and the most interesting stuff is probably in what I said, and not so much in what I wrote in those slides. But nonetheless I believe this might still be quite interesting for attendees as well as non-attendees.

The talk focuses on the audio architecture of the Nokia N900 and the Palm Pre, and of course particularly their use of PulseAudio for all things audio. I analyzed and compared their patch sets to figure out what their priorities are, what we should move into PulseAudio mainline, and what should better be left in their private patch sets.

posted at: 19:04 | path: /projects | permanent link to this entry | 2 comments

Wed, 24 Feb 2010

Measure Your Sound Card!

In recent versions PulseAudio integrates the numerous mixer elements ALSA exposes into one single powerful slider which tries to make the best of the granularity and range of the hardware and extends that in software so that we can provide an equally powerful slider on all systems. That means if your hardware only supports a limited volume range (many integrated USB speakers for example cannot be completely muted with the hardware volume slider), limited granularity (some hardware sliders only have 8 steps or so), or no per-channel volumes (many sound cards have a single slider that covers all channels), then PulseAudio tries its best to make use of the next hardware volume slider in the pipeline to compensate for that, and so on, finally falling back to software for everything that cannot be done in hardware. This is explained in more detail here.

Now this algorithm depends on that we know the actual attenuation factors (factors like that are usually written in units of dB which is why I will call this the "dB data" from now on) of the hardware volume controls. Thankfully ALSA includes that information in its driver interfaces. However for some hardware this data is not reliable. For example, one of my own cards (a Terratec Aureon 5.1 MkII USB) contains invalid dB data in its USB descriptor and ALSA passes that on to PulseAudio. The effect of that is that the PulseAudio volume control behaves very weirdly for this card, in a way that the volume "jumps" and changes in unexpected ways (or doesn't change at all in some ranges!) when you slowly move the slider, or that the volume is completely muted over large ranges of the slider where it should not be. Also this breaks the flat volume logic in PulseAudo, to the result that playing one stream (let's say a music stream) and then adding a second one (let's say an event sound) might incorrectly attenuate the first one (i.e. whenever you play an event sound the music changes in volume).

Incorrect dB data is not a new problem. However PulseAudio is the first application that actually depends on the correctness of this data. Previously the dB info was shown as auxiliary information in some volume controls, and only noticed and understood by very few, technical people. It was not used for further calculations.

Now, the reasons I am writing this blog posting are firstly to inform you about this type of bug and the results it has on the logic PulseAudio implements, and secondly (and more importantly) to point you to this little Wiki page I wrote that explains how to verify if this is indeed a problem on your card (in case you are experiencing any of the symptoms mentioned above) and secondly what to do to improve the situation, and how to get correct dB data that can be included as quirk in your driver.

Thank you for your attention.

posted at: 01:49 | path: /projects | permanent link to this entry | 0 comments

Sun, 21 Feb 2010

Horizontal Panoramas Are So 2009!

Horizontal panoramas are so 2009 -- which is why I now give you the vertical panorama:

Brussels Cathedral

Now if I wasn't too stupid to hold my camera steady shooting upwards, this could actually have been a really good picture.

posted at: 02:31 | path: /photos | permanent link to this entry | 8 comments

Speaker Setup

While tracking down some surround sound related bugs I was missing a speaker setup and testing utility. So I decided to do something about it and I present you gnome-speaker-setup:

gnome-speaker-setup

The tool should be very robust and even deal with the weirdest channel mappings. OTOH the artwork is not really good and appropriate. But I hope it still shows some resemblance to other UIs of this type. If you are an artist wand want to contribute better artwork make sure to go through the Gnome Art Requests page, and more specifically this particular request.

This (or something like it) will hopefully and eventually end up in some way or another in gnome-media. Until that day comes I'll maintain this tool independently.

To compile this you need a recent Vala and libcanberra 0.23.

posted at: 00:58 | path: /projects | permanent link to this entry | 18 comments

Tue, 19 Jan 2010

India, 360 Degrees at a Time, Part Seven

Here's the seventh and final part of my ongoing series.

One of the grandest sights in Delhi is Humayun's tomb, a predecessor of the greatest mausoleum of them all, the Taj Mahal:

Humayun's Tomb

A little bit further down a view on the garden:

Humayun's Tomb

From a different corner:

Humayun's Tomb

We'll finish with our last panorama that shows the courtyard the Jama Masjid of Old Delhi:

Jama Masjid

That's all panoramas from this trip. Thanks for your interest.

posted at: 21:43 | path: /photos | permanent link to this entry | 2 comments

Mon, 18 Jan 2010

India, 360 Degrees at a Time, Part Six

Here's the sixth part of my ongoing series.

Leaving Jodhpur we continued our journey to Jaisalmer, a sand castle of a town in the Thar desert:

Jaisalmer

In the vicinity of Jaisalmer you'll find cliche sand dunes like you'd expect from a grown-up desert:

Jaisalmer

Our next station after a long, cold and dusty train ride was Delhi. The principal mosque of Old Delhi is the Jama Masjid:

Jama Masjid

That's all for now, tomorrow I'll post the rest of my panoramas from this trip, all from Delhi.

posted at: 22:14 | path: /photos | permanent link to this entry | 1 comments

Sun, 17 Jan 2010

India, 360 Degrees at a Time, Part Five

Here's the fourth part of my ongoing series.

After Udaipur the next stop on our trip was Jodhpur, the blue city. Which is called that way due of the blue colour of many of its houses:

Jodhpur

On a hill next to Mehrangarh Fort, one of the biggest Forts in India (the big sand castle on the hill in the panorama above), you find the Jaswant Thada, a memorial of the Maharajas of Jodhpur:

Jodhpur

Inside the fort you'll find highly decorated courtyards:

Jodhpur

That's all for Jodhpur, tomorrow I'll post more panoramas, from other stops of our trip.

posted at: 18:43 | path: /photos | permanent link to this entry | 0 comments

Sat, 16 Jan 2010

India, 360 Degrees at a Time, Part Four

Here's the fourth part of my ongoing series.

After Hampi we went to Bangalore to attend foss.in. (Fantastic conference, btw. The concerts at the venue are unparalleled.) From there we flew up to Udaipur, in Rajasthan. Udaipur is (among other things) famous for being the place where the central scenes of Octopussy were filmed. Octopussy's famous white palace is on Jagniwas Island in Lake Pichola:

Udaipur

This panorama was taken from another island in the lake, Jagmandir Island, which is visible in the following shot on the left:

Udaipur

Udaipur's scenery, seen from the Maharaja's City Palace down onto Pichola Lake:

Udaipur

That's all for Udaipur, tomorrow I'll post more panoramas, from other stops of our trip.

posted at: 03:10 | path: /photos | permanent link to this entry | 5 comments

Announcing udev-browse

It's easy to get lost in /sys and not much fun typing long udevadm info command lines all the time. Today, when I had enough of that I sat down and spent an hour to write a little UI for exploring the udev/sysfs tree: udev-browse. I wrote it for my own use, but I am quite sure I am not the only one who wants a little bit simpler access to the device tree. So here you go.

And since everybody loves screenshots here you go:

udev-browse

Two usability hints: if you run udev-browse from a directory in /sys udev-browse will automatically present the device of that path on startup. And if you know the name of a device you can just type it into the device listbox (which is focussed by default). The usual Gtk+ live search will then find you the right entry right-away. It's pretty nifty.

It's written in Vala with minimal dependencies.

I want to keep the maintainership burden for this minimal. So no tarballs, no releases, and I won't reply to your emails regarding this tool, unless they include a good, clean, git formatted patch. Thank you for your understanding.

Anyone wants to package this for Fedora? I'd be very thankful if someone would pick it up.

Have fun.

posted at: 02:19 | path: /projects | permanent link to this entry | 9 comments

Thu, 14 Jan 2010

India, 360 Degrees at a Time, Part Three

Here's the third part of my ongoing series.

Still in Hampi here's another 360 from the Hills in Hampi down to the Achyutaraya Temple:

Matanga Hill

A little further down, before dawn, here's a shot from the rocky path leading up the hill:

Matanga Hill

Our last picture for today is a view down from Hemakuta Hill which is covered with old temples and other structures. In the middle you'll see the large Virupaksha Temple which is still in full use. In that temple you'll find an amazing camera obscura, a physics teacher's dream that projects the temple tower onto a wall (projection, subject, more interesting in reality. Really.)

Hemakuta Hill

That's all for Hampi, tomorrow I'll post more panoramas, from other stops of our trip.

posted at: 23:47 | path: /photos | permanent link to this entry | 3 comments

Wed, 13 Jan 2010

Public Service Announcement: Beware of xmlCleanupParser()!

Everyone and his dog seem to call libxml2's xmlCleanupParser() at inappropriate places. For example Empathy does it, and Abiword does it too. Google Code Search seems to reveal at least Inkscape and Dia do it as well.

So, please, if your project links against libxml2 verify that it calls xmlCleanupParser() only once, and right before exiting! And if it calls it more often or somewhere else, then please fix that!

For more information see my post on fedora-devel.

Thanks for your time.

posted at: 00:29 | path: /projects | permanent link to this entry | 5 comments

Tue, 12 Jan 2010

India, 360 Degrees at a Time, Part Two

Here's the second part of my ongoing series.

Climbing down the hills, on the banks of the Tungabhadra river you find people washing laundry and bathing, and coracles waiting to be used for a trip through the river.

Tungabhadra River

The greatest of the ancient temples in Hampi is the Vitthala Temple:

Vitthala Temple

Set in in lush green scenery you find the Achyutaraya Temple, which you already might have seen, from above, in yesterday's series:

Achyutaraya Temple

That's it for today, tomorrow I'll post more panoramas, both from Hampi and other stops of our trip.

posted at: 19:05 | path: /photos | permanent link to this entry | 2 comments

Mon, 11 Jan 2010

India, 360 Degrees at a Time, Part One

Yes, I won't spare you my panorama shots from my recent trip to India. After arriving in Goa Badami was our next stop. It's a very pretty little town in northern Karnataka, and here's a panorama shot from the entrance of the town's famous caves:

Badami

Next step was one of the most amazing places on earth, Hampi in central Karnataka. It is definitely one of the greatest sights I have ever seen, and I guess I can say I have seen quite a few in my life. A vast landscape of hills covered in boulders, lush mango and banana plantations, rice fields, dotted with age-old temples and impressive ruins. Locals crossing the river in coracles that look like they belong in a time centuries ago. Women washing colourful laundry in the river, pilgrims wading across the river in their black clothes. An India that delivers every bit of that promise it makes to its visitors. The ruins rival the grand sites in Greece and the landscape sometimes looks like a Crysis in-game scene.

Taken from one of the hills in Hampi this is the sunset:

Hampi Sunset

And then, the next day at dawn make your way up the hills again and you can get an even greater view on the whole scenery:

Hampi Dawn

That's it for today, tomorrow I'll post more panoramas, both from Hampi and other stops of our trip.

Also, if you haven't seen them yet, don't miss my panoramas from my India trip the year before.

posted at: 20:56 | path: /photos | permanent link to this entry | 6 comments

Thu, 31 Dec 2009

Jodhpur After Dark

Jodhpur Jodhpur Jodhpur

India is a weird and beautiful country. And I am too lazy to retouch my photos.

posted at: 16:33 | path: /photos | permanent link to this entry | 3 comments

Fri, 13 Nov 2009

On OOM

Building on what Havoc wrote two years ago about the fallacies of OOM safety (Out Of Memory) in user code I'd like to point you to this little mail I just posted to jack-devel which tries to give you the bigger picture. Should be interesting for non-audio folks, too.

Say NO to OOM safety!

posted at: 02:25 | path: /projects | permanent link to this entry | 10 comments

Fri, 06 Nov 2009

Public Service Announcement

Folks! Since quite some time now the kernel exports the DMI machine information below /sys/class/dmi/id/. You may stop now parsing the output of dmidecode thus depending on external tools and privileged code.

For example, to read your BIOS vendor string all you need to do is this:

$ read bv < /sys/class/dmi/id/bios_vendor
$ echo $bv

Which is of course much simpler, and cleaner, and safer than anything involving dmidecode.

Thank you for your time!

posted at: 11:14 | path: /projects | permanent link to this entry | 5 comments


It should be obvious but in case it isn't: the opinions reflected here are my own. They are not the views of my employer, or Ronald McDonald, or anyone else.

Please note that I take the liberty to delete any comments posted here that I deem inappropriate, off-topic, or insulting. And I excercise this liberty quite agressively. So yes, if you comment here, I might censor you. If you don't want to be censored your are welcome to comment on your own blog instead.


Lennart Poettering <mzoybt (at) 0pointer (dot) net>
Syndicated on Planet GNOME, Planet Fedora, planet.freedesktop.org, Planet Debian Upstream. feed RSS 0.91, RSS 2.0
Archives: 2005, 2006, 2007, 2008, 2009, 2010

Valid XHTML 1.0 Strict!   Valid CSS!