レナート   TBFKAYIBYNYAAYB   ﻟﻴﻨﺎﺭﺕ

Mon, 05 Nov 2007

Emulated atomic operations and real-time scheduling

Unfortunately not all CPU architectures have native support for atomic operations, or only support a very limited subset. Most prominently ARMv5 (and older) hasn't any support besides the most basic atomic swap operation[1]. Now, more and more free code is starting to use atomic operations and lock-free algorithms, one being my own project, PulseAudio. If you have ever done real-time programming you probably know that you cannot really do it without support for atomic operations. One question remains however: what to do on CPUs which support only the most basic atomic operations natively?

On the kernel side atomic ops are very easy to emulate: just disable interrupts temporarily, then do your operation non-atomically, and afterwards enable them again. That's relatively cheap and works fine (unless you are on SMP -- which fortunately you usually are not for those CPUs). The Linux kernel does it this way and it is good. But what to do in user-space, where you cannot just go and disable interrupts?

Let's see how the different userspace libraries/frameworks do it for ARMv5, a very relevant architecture that only knows an atomic swap (exchange) but no CAS or even atomic arithmetics. Let's start with an excerpt from glibc's atomic operations implementation for ARM:

/* Atomic compare and exchange.  These sequences are not actually atomic;
   there is a race if *MEM != OLDVAL and we are preempted between the two
   swaps.  However, they are very close to atomic, and are the best that a
   pre-ARMv6 implementation can do without operating system support.
   LinuxThreads has been using these sequences for many years.  */

This comment says it all. Not good. The more you make use of atomic operations the more likely you're going to to hit this race. Let's hope glibc is not a heavy user of atomic operations. PulseAudio however is, and PulseAudio happens to be my focus.

Let's have a look on how Qt4 does it:

extern Q_CORE_EXPORT char q_atomic_lock;

inline char q_atomic_swp(volatile char *ptr, char newval)
{
    register int ret;
    asm volatile("swpb %0,%1,[%2]"
                 : "=&r"(ret)
                 : "r"(newval), "r"(ptr)
                 : "cc", "memory");
    return ret;
}

inline int q_atomic_test_and_set_int(volatile int *ptr, int expected, int newval)
{
    int ret = 0;
    while (q_atomic_swp(&q_atomic_lock, ~0) != 0);
    if (*ptr == expected) {
	*ptr = newval;
	ret = 1;
    }
    q_atomic_swp(&q_atomic_lock, 0);
    return ret;
}

So, what do we have here? A slightly better version. In standard situations it actually works. But it sucks big time, too. Why? It contains a spin lock: the variable q_atomic_lock is used for locking the atomic operation. The code tries to set it to non-zero, and if that fails it tries again, until it succeeds, in the hope that the other thread -- which currently holds the lock -- gives it up. The big problem here is: it might take a while until that happens, up to 1/HZ time on Linux. Usually you want to use atomic operations to minimize the need for mutexes and thus speed things up. Now, here you got a lock, and it's the worst kind: the spinning lock. Not good. Also, if used from a real-time thread the machine simply locks up when we enter the loop in contended state, because preemption is disabled for RT threads and thus the loop will spin forever. Evil. And then, there's another problem: it's a big bottleneck, because all atomic operations are synchronized via a single variable which is q_atomic_lock. Not good either. And let's not forget that only code that has access to q_atomic_lock actually can execute this code safely. If you want to use it for lock-free IPC via shared memory this is going to break. And let's not forget that it is unusable from signal handlers (which probably doesn't matter much, though). So, in summary: this code sucks, too.

Next try, let's have a look on how glib does it:

static volatile int atomic_spin = 0;

static int atomic_spin_trylock (void)
{
  int result;

  asm volatile (
    "swp %0, %1, [%2]\n"
    : "=&r,&r" (result)
    : "r,0" (1), "r,r" (&atomic_spin)
    : "memory");
  if (result == 0)
    return 0;
  else
    return -1;
}

static void atomic_spin_lock (void)
{
  while (atomic_spin_trylock())
    sched_yield();
}

static void atomic_spin_unlock (void)
{
  atomic_spin = 0;
}

gint
g_atomic_int_exchange_and_add (volatile gint *atomic,
			       gint           val)
{
  gint result;

  atomic_spin_lock();
  result = *atomic;
  *atomic += val;
  atomic_spin_unlock();

  return result;
}

Once again, a spin loop. However, this implementation makes use of sched_yield() for asking the OS to reschedule. It's a bit better than the Qt version, since it doesn't spin just burning CPU, but instead tells the kernel to execute something else, increasing the chance that the thread currently holding the lock is scheduled. It's a bit friendlier, but it's not great either because this might still delay execution quite a bit. It's better then the Qt version. And probably one of the very few ligitimate occasions where using sched_yield() is OK. It still doesn't work for RT -- because sched_yield() in most cases is a NOP on for RT threads, so you still get a machine lockup. And it still has the one-lock-to-rule-them-all bottleneck. And it still is not compatible with shared memory.

Then, there's libatomic_ops. It's the most complex code, so I'll spare you to paste it here. Basically it uses the same spin loop. With three differences however:

  1. 16 lock variables instead of a single one are used. The variable that is used is picked via simple hashing of the pointer to the atomic variable that shall be modified. This removes the one-lock-to-rule-them-all bottleneck.
  2. Instead of pthread_yield() it uses select() with a small timeval parameter to give the current holder of the lock some time to give it up. To make sure that the select() is not optimized away by the kernel and the thread thus never is preempted the sleep time is increased on every loop iteration.
  3. It explicitly disables signals before doing the atomic operation.

It's certainly the best implementation of the ones discussed here: It doesn't suffer by the one-lock-to-rule-them-all bottleneck. It's (supposedly) signal handler safe (which however comes at the cost of doing two syscalls on every atomic operation -- probably a very high price). It actually works on RT, due to sleeping for an explicit time. However it still doesn't deal with priority inversion problems -- which is a big issue for real-time programming. Also, the time slept in the select() call might be relatively long, since at least on Linux the time passed to select() is rounded up to 1/HZ -- not good for RT either. And then, it still doesn't work for shared memory IPC.

So, what do we learn from this? At least one thing: better don't do real-time programming with ARMv5[2]. But more practically, how could a good emulation for atomic ops, solely based on atomic swap look like? Here are a few ideas:

Yepp, that's as good as it gets. Unfortunately I cannot serve you the optimal solution on a silver platter. I never actually did development for ARMv5, this blog story just sums up my thoughts on all the code I saw which emulates atomic ops on ARMv5. But maybe someone who actually cares about atomic operations on ARM finds this interesting and maybe invests some time to prepare patches for Qt, glib, glibc -- and PulseAudio.

Update: I added two more ideas to the list above.

Update 2: Andrew Haley just posted something like the optimal solution for the problem. It would be great if people would start using this.

Footnotes

[1] The Nokia 770 has an ARMv5 chip, N800 has ARMv6. The OpenMoko phone apparently uses ARMv5.

[2] And let's not even think about CPUs which don't even have an atomic swap!

[3] Which however you probably won't, given that they're only available on x86 on stable Linux kernels for now -- but still, it's cleaner.

posted at: 00:17 | path: /projects | permanent link to this entry | 17 comments


Posted by Andreas at Mon Nov 5 01:06:23 2007
Interesting stuff.
My guess about the Qt version is that it was implemented it to be correct first and it was intended to improve performance later if/when people need it.
By the way, which devices/use cases are you targeting with the ARM5 version of PulseAudio?

Posted by Lennart at Mon Nov 5 01:53:28 2007
Andreas: Atomic operations are nothing more than an optimization anyway. So why have them if they themselves are not optimized?

I am not working on an ARMv5 version PA. As mentioned in the last paragraph I never coded for ARMv5 also myself don't care too much about it. These are just my thoughts.

Posted by Andreas at Mon Nov 5 02:40:18 2007
Lennart, the thing is that first of all code that uses a multiplatform toolkit should compile on any platform so you need an implementation of everything. The speed of that implementation is not as important as being compatible.

Posted by Paul Betts at Mon Nov 5 06:30:42 2007
Look at how the Linux Kernel does it - I seem to remember they've solved this problem for at least one arch. Specifically, I seem to remember the SPARC implementation did some clever things to pretend to have atomic ops.

Posted by Helge at Mon Nov 5 09:23:16 2007
There is an even better way to solve this problem, but it requires some kernel support; suppose you want to atomically add 1 to a variable v, let the compiler generate asm code for the following:

label0: int tmp=v;
label1: tmp=tmp+1;
label2: v=tmp;

now you teach the kernel about "restart points" for "preemption sections": if the thread is preempted at either address label1 or label2, do not resume from the same address, but jump back to the restart point label0.

ARMv5 does not allow SMP, so this is in fact safe and BTW considerably more efficient than all other approaches; I have seen this implemented, but I don't think Linux does this currently.

Regards

Posted by Koen at Mon Nov 5 09:32:01 2007
Both the fic-gta01 adn fic-gta02 versions for the neo1973 (openmoko is a company, not a device) are usig the arm920t, which is armv4t, not armv5. BTW, libgcc seems to contain atomic operations as well, since alsa-lib doesn't compile if you gcc lacks it (e.g AVR32 and gcc 4.1.x)

Posted by tinus at Mon Nov 5 09:47:59 2007
Have you checked how often these locks really hit? Probably in real life, it's not as often as you think, so it's not that important if resolution is a bit slow.

Posted by Lennart at Mon Nov 5 15:11:31 2007
Paul: As I wrote in the blog story, the Linux kernel disables interrupts temporarily to implement atomic operations on ARM. This is not doable from user space.

Koen: gcc has __sync style atomic ops. But they are not available on all archs. ARM being one of them

tinus: The thing is that the lock is taken on every single atomic operation. The more threads you have, the more atomic operations you use, the more often you hit the lock.

Posted by Andrew Haley at Mon Nov 5 15:39:00 2007
The user space way to do this is much easier than you suggest:

typedef int (__kernel_cmpxchg_t)(int oldval, int newval, int *ptr);

#define __kernel_cmpxchg (*(__kernel_cmpxchg_t *)0xffff0fc0)

That's it: the kernel places the appropriate string of instructions at 0xffff0fc0.  If you have an ARM that can properly support atomic compare and exchange, the instructions will be there; it it can't there will be a syscall.  All you have to do is call it.

Posted by Lennart at Mon Nov 5 15:42:11 2007
Andrew: that's awesome! Great stuff! It's a pity it isn't used more. Is this documented anywhere?

Posted by Andrew Haley at Mon Nov 5 17:59:35 2007
I haven't found any documentation outside the kernel source.

It is quite well-documented there, though: see arch/arm/kernel/entry-armv.S

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob_plain;f=arch/arm/kernel/entry-armv.S;hb=HEAD

Posted by Federico at Wed Feb 6 11:28:41 2008
I'm not an expert but I've read that the ARM processors have an "undefined instruction" interrupt, that is designed to emulate by software undefined instructions. If these atomic instructions are so important, why not emulating them?

Posted by Lennart at Tue Feb 12 14:34:03 2008
Federico: Good question. Maybe traps are just very slow? But honestly, I have no idea.

Posted by Federico at Sat Feb 16 20:58:33 2008
I don't think that undefined instruction interrups could be slow. What I think is, since some processors have those atomic instruction, that it could be possible to use THE VERY SAME OPCODES, and this will allow to compile userspace packages in only ONE version, for all ARM processors. If the code runs on a processor where those instructions exist, it will run smoothly. If instead it is run on a processor that does not have them, then this will cause undefined instruction interrups, and the kernel will emulate them.

Posted by Dan Leinir Turthra Jensen at Tue Jul 22 13:47:05 2008
The OpenMoko Freerunner uses ARMv4, as mentioned on this page: http://wiki.openmoko.org/wiki/Neo_FreeRunner_GTA02_Hardware

Posted by mat at Sun Jan 11 11:38:56 2009
First recent glibc (nptl code not linuxthread) use the kernel helper stuff : __kernel_cmpxchg

Second not everybody can use __kernel_cmpxchg, it is not really portable :
- only works on Linux 2.6
- only work with mmu

Third instruction emulation with trap is very slow : the kernel helper is less than 10 instructions, the kernel trap is lot's of more instruction.

Posted by David Holmes at Mon Oct 26 17:46:59 2009
Note that the qt version using q_atomic_lock has additional problems. Many lock-free algorithms use CAS in places but naked stores elsewhere (eg to release a 'lock'), and some such algorithms will fail unless the naked store is converted to acquire the q_atomic_lock as well.

Leave a Comment:

Your Name:


Your E-mail (optional):


Comment:


As a protection against comment spam, please type the following number into the field on the right:
Secret Number Image

Please note that this is neither a support forum nor a bug tracker! Support questions or bug reports posted here will be ignored and not responded to!


It should be obvious but in case it isn't: the opinions reflected here are my own. They are not the views of my employer, or Ronald McDonald, or anyone else.

Please note that I take the liberty to delete any comments posted here that I deem inappropriate, off-topic, or insulting. And I excercise this liberty quite agressively. So yes, if you comment here, I might censor you. If you don't want to be censored your are welcome to comment on your own blog instead.


Lennart Poettering <mzoybt (at) 0pointer (dot) net>
Syndicated on Planet GNOME, Planet Fedora, planet.freedesktop.org, Planet Debian Upstream. feed RSS 0.91, RSS 2.0
Archives: 2005, 2006, 2007, 2008, 2009, 2010

Valid XHTML 1.0 Strict!   Valid CSS!