The MCS lock

2017 Sep 26

In a previous post, we saw what ticket locks are, and how they are a useful building block in order to build systems with some latency and/or fairness guarantees. We saw that the ticket lock guarantees a FIFO ordering on incoming threads and prevents starvation with a relatively low latency impact. However, the traffic generated per lock acquisition is linear to the number of threads competing for the lock. In this post, we will see how the MCS lock addresses this last point.

The problem

Let’s start by observing the problem. Here’s the code attempting to grab a ticket lock, as we saw previously:

void enter() {
    const auto ticket = next_ticket.fetch_add(1, std::memory_order_relaxed);

    while (now_serving.load(std::memory_order_acquire) != ticket) {
        spin_wait();
    }
}

Suppose there are multiple processes running this code and attempting to grab the lock. What would be happening here? Each CPU spins on the same variable, the now_serving counter. One CPU, which held the lock at a past time, releases it and increments now_serving. This means the cache line must be invalidated for all other CPUs, causing each of them to issue a new read for the counter’s cache line. At worst, when each request for the cache line is serviced sequentially, then the time to service these reads is linear to the number of waiting CPUs.

We observed this problem with ticket locks and a specific back-off strategy, as well. Can we improve the time to acquire the lock?

The solution

It turns out that yes, we can make sure the time to acquire the lock is constant to the number of waiting CPUs. The idea is to let each CPU spin on a local variable, i.e. not shared among other CPUs.

Let there be a queue, which contains a node for each CPU waiting on the lock. Every CPU which wants to wait on the lock allocates a queue node, containing a link (to the next in queue) and a boolean flag:

struct mcs_lock {
    std::atomic<struct mcs_node *> tail;

    struct mcs_node {
        struct mcs_node *next;
        bool locked;
    };

    thread_local static struct mcs_node qnode;
};

Of course, we can’t forego synchronisation. Spinning on a local variable implies it is synchronised via some other means. This means is the queue. A CPU attempting to obtain the lock does the following:

static inline mcs_lock::enter() {
    /* Atomically place ourselves at the end of the queue: */
    const auto predecessor = tail.exchange (&this.qnode);

    /**
     * If tail was nullptr, predecesor is nullptr, thus nobody has been waiting,
     * and we've acquired the lock.
     * Otherwise, we need to place ourselves in the queue, and spin:
     */
    if (predecessor != nullptr) {
        /**
         * If the lock is taken, there's two cases:
         * 1. Either there is nobody waiting on the lock, and *tail == this.qnode (more
         *    on this later)
         * 2. One or more CPUs are waiting on the lock, and *tail is the tail of the queue
         * Either way, we mark the lock is taken:
         */
        qnode.locked = true;

        /* Link ourselves to the tail of the queue: */
        predecessor->next = &this.qnode;

        /* Now we can spin on a local variable: */
        while (qnode.locked) {
            spin_wait();
        }
    }
}

It is pretty simple. FIFO ordering is guaranteed by the queue, and each CPU can spin in a variable which is thread-local. If it were in shared memory we could - again - align on a cache line boundary to prevent false sharing, like we did with the ticket lock. We just have to be mindful of these limitations.

The release of the lock is similarly simple. The only complication is when we can’t atomically change the queue; this means we are trying to change it when another CPU is racing against us, trying to acquire the lock. Of course, we will let that CPU obtain the lock, we need only spin - again - on our local queue node:

static inline mcs_lock::leave() {
    /* We are holding the lock, therefore qnode.next is our successor: */
    const auto successor = qnode.next;

    if (successor == nullptr) {
        if (tail.compare_exchange_strong(&this.qnode, nullptr)) {
            /* No CPUs were waiting for the lock, set it to nullptr: */
            return;
        }
    }

    /**
     * We could not set our successor to nullptr, therefore qnode.next is out of sync with tail,
     * therefore another CPU is in the middle of `enter`, prior to linking themselves in the queue.
     * We wait for that to happen:
     */
    while (successor == nullptr) ;

    /* The other CPU has linked themselves, all we need to do is wake it up as the next-in-line: */
    successor->locked = false;
}

See here for a step-by-step illustration of how enter() and leave() (acquire and release) work.

Fairness

It is pretty straightforward to prove the lock will be granted to CPUs in a FIFO order. Waking up the next in line means that the scheduler has to be invoked, the thread woken up and scheduled, and so forth. This means these locks trade-off contention for fairness. This is a useful trade-off to keep in mind when implementing, measuring, and even devising locks.

Performance

Try to measure the relative performance of an MCS lock and a ticket lock in various scenarios, and machines. You may run into the situation that, the bus latency which we noted ticket locks suffer from isn’t a problem everywhere. Thus, there are cases where ticket locks are indeed more performant than the above! Always measure!

The interface

You’d have noticed the above code excerpts aren’t actually valid C++. Additionally, if the interface of the implementation above leaves you wanting, you are not alone. The implementation of an MCS lock is complicated by the fact each thread requires shared access to other threads’ queue nodes, in order to wake them up.

Luckily, we don’t have to worry about this very much. MCS locks right now are mostly a teaching tool, and have mostly been superseded by:

  • CLH locks: Craig, Landin, and Hagersten locks replace the explicit queue for a logical queue
  • K42 locks: On-stack information is used instead of keeping a thread-local queue node around
  • A similar idea is used by the stack-lock algorithm

In the next few posts, we will discuss those locks.

« Past Future »