Sunday, November 21, 2010

Linux threading primitives: the futex

Linux and glibc support POSIX threads. Unsurprisingly, POSIX concepts like mutexs, condition variables, etc. are not baked into the kernel. How then does glibc's implementation of pthreads for Linux work? Well, for mutual exclusion and signaling (that is mutexes, condition variables, semphores, etc.) pthreads on Linux generally uses the "futex", which is an abbreviation for "fast user-space mutex". This is a somewhat deceptive name. The actual futex system call's capabilities bring more to mind a "fast user-space semaphore" but I guess "fuphore" isn't quite as catchy. We'll assume the name is descriptive of usage rather than the mechanism.

The futex syscall itself has 6 parameters.

int futex(int *uaddr, int op, int val, const struct timespec *timeout,
int *uaddr2, int val3);

The first is always the address of the futex. A futex is just an int (this brings to mind, at least to me, NT's keyed events), generally representing the count of the semaphore. The identity of that int (that is to say, not the _value_ of its address, because that can vary across processes) is how the kernel distinguishes futexes. This is a neat trick, avoiding the requirement of a global namespace (such as with named events, mutexes etc. in Windows), while allowing for it by leveraging existing mechanisms (mmap, SysV shared memory, etc.). Of course, by typing this to the virtual address space of the process, this can't serve as cluster wide mechanism (mmaping a file shared over NFS is not going to be sufficient for example).

The fourth parameter to the call is an optional timeout, which applies when using some operations. The third, fifth and sixth parameters meaning vary based on the operation.

The operation is of course the second parameter. I'm only going to talk about two of the most basic operations, FUTEX_WAIT and FUTEX_WAKE.

For FUTEX_WAIT, the val parameter is the current value of *uaddr. When you call futex with the wait operation, it checks atomically that *uaddr == val. If not, you return from futex right away. Otherwise, you FUTEX_WAIT until someone FUTEX_WAKEs you up.

FUTEX_WAKE wakes val processes (for the purposes of this discussion thread and process are synonymous) waiting on the futex uaddr.

These two operations, in conjunction with atomic operations on int values give us everything we need to implement a semaphore. We can then use that semaphore to implement various other synchronization primitives (a mutex can be modelled as a binary semaphore for example).

The "fast" and the "user-space" parts both refer to the idea that you don't actually call the system call save in the case of contention. Rather you platform specific code to increment and decrement *uaddr. If there is no contention, we need never go to the kernel to arbitrate.

To demonstrate this, I'm going to show a small library for working with futexes. Note that I'm not using them entirely correctly, which you will discover if you check out the "further reading" section.

First, we'll create an API for dealing with semaphores implemented with futexes. it is written in C, and I have named the header simple_futex.h

#ifdef __cplusplus
extern "C" {
#endif
struct simplefu_semaphore {
int avail;
int waiters;
};

typedef struct simplefu_semaphore *simplefu;

void simplefu_down(simplefu who);
void simplefu_up(simplefu who);


#ifdef __cplusplus
}
#endif


Our sempahore consists of two operations, simplefu_down and simplefu_up, which perform as one expects a semaphore to. The avail member of the struct represents the count of the semaphore, the waiters member is used for tracking if we need to issue a wake operation.

Here, in simple_futex.c is the implementation:






#include "simple_futex.h"
#include <linux/futex.h>
#include <sys/syscall.h>

void simplefu_down(simplefu who)
{
int val;
do {
val = who->avail;
if( val > 0 && __sync_bool_compare_and_swap(&who->avail, val, val - 1) )
return;
__sync_fetch_and_add(&who->waiters, 1);
syscall(__NR_futex, &who->avail, FUTEX_WAIT, val, NULL, 0, 0);
__sync_fetch_and_sub(&who->waiters, 1);
} while(1);
}

void simplefu_up(simplefu who)
{
int nval = __sync_add_and_fetch(&who->avail, 1);
if( who->waiters > 0 )
{
syscall(__NR_futex, &who->avail, FUTEX_WAKE, nval, NULL, 0, 0);
}
}
A few things. This code is gcc specific, I'm using its atomic built-ins. I am using an atomic compare and swap to decrement the count only if doing so would not reduce the count to below zero. I am using an additional count to track if the semaphore ahs been contended. The futex documentation says you should use atomic increments/decrements and treat -1 as being contended. Since this was for my own edification I decided to use a mechanism that I felt was more clear. I would not attempt to use this semaphore in the real world.

The basics of the down operation are "Try to atomically decrement the avail member, if that is not possible, note that we are going to wait by incrementing waiters and then wait on the futex. Then decrement waiters and try again." The basics of the up operation are "increment the available count. Additionally if there are waiters, wake as many of them that could proceed given the current count.".

Now that we have a semaphore implementation we can endeavorer to come up with a use for it.

In this program, we start a child process and wait for it to perform some initialization before using it further. In this example, the initialization is represented by a sleep. main.c:

#include <stdio.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <assert.h>

#include "simple_futex.h"

int main()
{
int lockfile = open("ipc_lock", O_RDWR);
assert(lockfile != -1);
simplefu sema = mmap(NULL, sizeof(*sema), PROT_READ|PROT_WRITE,
MAP_SHARED, lockfile, 0);
assert(sema != MAP_FAILED);

int pid = fork();
assert(pid != -1);
if( pid == 0 ) { // child
printf("Initializing...\n");
sleep(10);
printf("Done initializing\n");
simplefu_up(sema);
exit(0);
}
printf("Waiting for child...\n");
simplefu_down(sema);
printf("Child done initializing\n");

return 0;
}

ipc_lock is simply an 8 byte file (the platform specific size of *sema), filled with zeros. You can make your own with dd if=/dev/zero of=ipc_lock bs=8 count=1.

If we run this guy we get:
% ./a.out
Waiting for child...
Initializing...
Done initializing
Child done initializing

Now, coming back to the "fast" and "user-space" part of this, if there's no contention, there's no need to make a syscall. We can demonstrate this by using a slightly different program, main2.c The only difference between main2.c and min.c is as follows:







% diff -du main.c main2.c
--- main.c 2010-11-21 16:57:50.000000000 +0000
+++ main2.c 2010-11-21 20:12:24.000000000 +0000
@@ -25,6 +25,7 @@
simplefu_up(sema);
exit(0);
}
+ sleep(15);
printf("Waiting for child...\n");
simplefu_down(sema);
printf("Child done initializing\n");

That is, the additional of a sleep(15) call before we start waiting for the child. This represents some other, real work that the other process may be doing at this time instead of waiting for its child to be ready. First, let's confirm that futex is called in the first program:





% strace -o traces ./a.out && grep futex traces
Initializing...
Waiting for child...
Done initializing
Child done initializing
futex(0xb7861000, FUTEX_WAIT, 0, NULL) = 0
Now, let's compare to the second program, where we're off busy doing something instead of immediately waiting for the child:





% strace -o traces ./a.out && grep futex traces
Initializing...
Done initializing
Waiting for child...
Child done initializing
As you can see, we did not have occasion to actually go to kernel land for this version.

If you are interested in futexes you find out more in the man pages, http://www.kernel.org/doc/man-pages/online/pages/man2/futex.2.html and http://www.kernel.org/doc/man-pages/online/pages/man7/futex.7.html as well as the original paper http://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf

2 comments:

Drawknob said...
This comment has been removed by the author.
Drawknob said...

Any idea why glibc only wakes one instead of the value of the counter in its semaphore implementation?

http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/sysdeps/unix/sysv/linux/sem_post.c