Ticket spinlocks

Ticket spinlocks! Fancy name! What are they, and why should you know about them?

The ticket lock is a synchronisation mechanism which was introduced by Mellor-Crummey & Scott in Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, and is meant to guarantee fairness in lock acquisition. It achieves this goal by granting requests in FIFO order. As its name suggests, it employs “tickets” for requesters, just as if they were waiting in a queue. On arrival customers draw a ticket from a ticket dispenser, which hands out tickets with increasing numbers. A screen displays the ticket number served next. The customer holding the ticket with the number currently displayed on the screen is served next.

Now serving

In NUMA systems, the usual spinlock implementations like test-and-set do not provide fairness guarantees. What this means is that there is a chance for starvation and thus inability to provide latency guarantees. If we can provide fairness, then we can bound the time a thread will spend waiting for a spinlock, and thus on its total latency for whatever work it’s doing.

The content of this post is a summary of section 8.1 of Yan Solihin’s Fundamentals of Parallel Multicore Architecture. The book is an excellent overview, but for the meat of this topic you should read the related papers.

Implementation

We’ll sketch a C++ implementation below.

To implement the ticket number currently served and the ticket number handed out to the next arriving thread, we’ll use two std::atomic_size_t variables. The implementation is targeted at x86, and I’ll point out exactly where this makes a difference. Do not consider this to be optimized - I haven’t profiled any of it. We will refer to the particulars as we go along. Let’s begin:

struct TicketSpinLock {
    /**
     * Attempt to grab the lock:
     * 1. Get a ticket number
     * 2. Wait for it
     */
    void enter() {
        /* We don't care about a specific ordering with other threads,
         * as long as the increment of the `next_ticket` counter happens atomically.
         * Therefore, std::memory_order_relaxed.
         */
        const auto ticket = next_ticket.fetch_add(1, std::memory_order_relaxed);

        while (now_serving.load(std::memory_order_acquire) != ticket) {
            spin_wait();
        }
    }

    /**
     * Since we're in the critical section, no one can modify `now_serving`
     * but this thread. We just want the update to be atomic. Therefore we can use
     * a simple store instead of `now_serving.fetch_add()`:
     */
    void leave() {
        const auto successor = now_serving.load(std::memory_order_relaxed) + 1;
        now_serving.store(successor, std::memory_order_release);
    }

    /* These are aligned on a cache line boundary in order to avoid false sharing: */
    alignas(CACHELINE_SIZE) std::atomic_size_t now_serving = {0};
    alignas(CACHELINE_SIZE) std::atomic_size_t next_ticket = {0};
};

static_assert(sizeof(TicketSpinLock) == 2*CACHELINE_SIZE,
    "TicketSpinLock members may not be aligned on a cache-line boundary");

As per the acquire/release semantics (for an overview, have a look here), independent threads need to synchronise on the now_serving counter: many customers are standing in front of a desk clerk, looking at the counter, and waiting their turn by continuously reading the displayed value. In the meantime, they are free to perform other operations in any order, relaxing the synchronisation required. This way, requester threads don’t pay any cost in shared memory operations that do not involve the lock.

What happens when the ticket dispenser overflows? With a bit of napkin sketching, we can quickly see that overflow is catastrophic only in the case where the number of threads waiting on the lock is strictly greater than the maximum number of values representable by the counter’s underlying type (otherwise, there is a coding error, and at least one thread is attempting to grab the same lock twice¹). Assume a 3-bit counter, and 8 threads competing for the lock. The condition now_serving != ticket is always false for the next thread in line. Only if we were to add one more thread, the next_ticket counter can now reach the same value now_serving has. This is very easy to see on a piece of paper:

   now_serving   next_ticket
        |             |
        V             V
... 0 1 2 3 4 5 6 7 0 1 2 ...
         \___________/
           8 threads
           holding one ticket each

This is a useful observation in that, when memory is at a premium, we can use shorter width counters as long as we can guarantee an upper bound on the number of threads. This is not a typical scenario, however - more of an interesting factoid. When would the number of ticket locks in an application be as great as to make this saving significant?

spin wait what?

We haven’t said anything about the spin_wait function. Ideally, a thread attempting to enter the critical section spins for a threshold of attempts, and while it is doing so it can incur some pretty heavy performance penalties. To illustrate this point on an x86 CPU, let’s have a look at the Intel instruction reference which says:

When executing a “spin-wait loop,” a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops.

Therefore, that’s exactly what we’ll do inside spin_wait:

static inline void spin_wait(void) {
#if (COMPILER == GCC || COMPILER == LLVM)
    /* volatile here prevents the asm block from being moved by the optimiser: */
    asm volatile("pause" ::: "memory");
#elif (COMPILER == MVCC)
    __mm_pause();
#endif
}

Fantastic. It’s good to note that rep; nop is a synonym for pause, and it appears in some implementations, e.g. Linux. Note that rep here is not a prefix of nop.

Handling contention

As it is, the lock is implemented to ignore contention. This allows the API to be simple, the common case (lack of contention) to be fast, and forces us to handle contention at the level of the data structure. This last point is useful because it allows us to provide a generic lock, decoupled from application or data structure logic.

Since we intend for contention to be rare, however, we can add functionality in the spinlock’s slow path to handle the case of high contention.

A common approach is exponential back-off (see 8.1.3). This is the case with the simple test-and-set lock in the aforementioned paper. This approach kills the ticket spinlock. Observe:

Let A be the thread with my_ticket == now_serving + 3
Let B be the thread with my_ticket == now_serving + 2
Let C be the thread with my_ticket == now_serving + 1
A must always wait at least as long as B
B must always wait at least as long as C

Therefore, as delays accumulate, most of the thread execution time is spent waiting for the lock. This is undesirable re: fairness guarantees. The ticket lock provides us with some extra information that the test-and-set lock doesn’t, and it helps us make a decision: the number of processors already waiting on the lock.

The number of threads waiting on the lock can be obtained by simply calculating the difference my_ticket - now_serving. As Crummey & Scott observe, we need an estimate of how long it will take each processor to release the lock. If this time is accurately known, great, but it’s unlikely in practice. A common occurrence for estimating it called “proportional back-off”, and it uses the expected average, which is a risky choice: if the processors in line for the lock average less than the expected amount, the waiting processor will delay too long, and thus slow down all threads. The very nature of the mean implies this is a scenario happening about half the time, and thus makes it quite undesirable.

The better choice is the minimum time that a processor may hold the lock:

static inline void TicketSpinLock::enter() {
    const auto ticket = next_ticket.fetch_add(1, std::memory_order_relaxed);

    while (true) {
        const auto currently_serving = now_serving.load(std::memory_order_acquire);
        if (currently_serving == ticket) {
            break;
        }

        const size_t previous_ticket = ticket - currently_serving;
        const size_t delay_slots = BACKOFF_MIN * previous_ticket;

        while (delay_slots--) {
            spin_wait();
        }
    }
}

In Synchronization without Contention, Crummey and Scott further present and explore scalable synchronization mechanisms with empirical data that explore the related performance aspects.

Concluding

What are the tradeoffs of the ticket lock?

[+] Guaranteed FIFO ordering; no starvation
[+] Relatively low latency hit
[+] Low network traffic
[-] Not constant traffic per lock acquisition, but linear

The last point is addressed by MCS (see the linked papers), which sounds even more exciting! Let’s have a look at them next.

There are multiple ways to find such problems with locks. For example, clang provides the Thread Safety Analysis extension. ↩

« Past Future »