Mechanical Sympathy: Lock-Based vs Lock-Free Concurrent Algorithms

Monday 26 August 2013

Lock-Based vs Lock-Free Concurrent Algorithms

Last week I attended a review session of the new JSR166 StampedLock run by Heinz Kabutz at the excellent JCrete unconference. StampedLock is an attempt to address the contention issues that arise in a system when multiple readers concurrently access shared state. StampedLock is designed to perform better than ReentrantReadWriteLock by taking an optimistic read approach.

While attending the session a couple of things occurred to me. Firstly, I thought it was about time I reviewed the current status of Java lock implementations. Secondly, that although StampedLock looks like a good addition to the JDK, it seems to miss the fact that lock-free algorithms are often a better solution to the multiple reader case.

Test Case

To compare implementations I needed an API test case that would not favour a particular approach. For example, the API should be garbage free and allow the methods to be atomic. A simple test case is to design a spaceship that can be moved around a 2-dimensional space with the coordinates of its position available to be read atomically. At least 2 fields need to be read, or written, per transaction to make the concurrency interesting.

/**
 * Interface to a concurrent representation of a ship that can move around
 * a 2 dimensional space with updates and reads performed concurrently.
 */
public interface Spaceship
{
    /**
     * Read the position of the spaceship into the array of coordinates provided.
     *
     * @param coordinates into which the x and y coordinates should be read.
     * @return the number of attempts made to read the current state.
     */
    int readPosition(final int[] coordinates);

    /**
     * Move the position of the spaceship by a delta to the x and y coordinates.
     *
     * @param xDelta delta by which the spaceship should be moved in the x-axis.
     * @param yDelta delta by which the spaceship should be moved in the y-axis.
     * @return the number of attempts made to write the new coordinates.
     */
    int move(final int xDelta, final int yDelta);
}

The above API would be cleaner by factoring out an immutable Position object but I want to keep it garbage free and create the need to update multiple internal fields with minimal indirection. This API could easily be extended for a 3-dimensional space and require the implementations to be atomic.

Multiple implementations are built for each spaceship and exercised by a test harness. All the code and results for this blog can be found here.

The test harness will run each of the implementations in turn by using a megamorphic dispatch pattern to try and prevent inlining, lock-coarsening, and loop unrolling when accessing the concurrent methods.

Each implementation is subjected to 4 distinct threading scenarios that result in different contention profiles:

1 reader - 1 writer
2 readers - 1 writer
3 readers - 1 writer
2 readers - 2 writers

All tests are run with 64-bit Java 1.7.0_25, Linux 3.6.30, and a quad core 2.2GHz Ivy Bridge i7-3632QM. Throughput is measured over 5 second periods for each implementation with the tests repeated 5 times to ensure sufficient warm up. The results below are throughputs averaged per second over 5 runs. To approximate a typical Java deployment, no thread affinity or core isolation has been employed which would have reduced variance significantly.

Note: Other CPUs and operating systems can produce very different results.

Results

Figure 1.

Figure 2.

Figure 3.

Figure 4.

The raw data for the above charts can be found here.

Analysis

The real surprise for me from the results is the performance of ReentrantReadWriteLock. I cannot see a use for this implementation beyond a case whereby there is a huge balance of reads and very little writes. My main takeaways are:

StampedLock is a major improvement over existing lock implementations especially with increasing numbers of reader threads.
StampedLock has a complex API. It is very easy to mistakenly call the wrong method for locking actions.
Synchronised is a good general purpose lock implementation when contention is from only 2 threads.
ReentrantLock is a good general purpose lock implementation when thread counts grow as previously discovered.
Choosing to use ReentrantReadWriteLock should be based on careful and appropriate measurement. As with all major decisions, measure and make decisions based on data.
Lock-free implementations can offer significant throughput advantages over lock-based algorithms.

Conclusion

It is nice seeing the influence of lock-free techniques appearing in lock-based algorithms. The optimistic strategy employed on read is effectively a lock-free algorithm at the times when a writer is not updating.

In my experience of teaching and developing lock-free algorithms, not only do they provide significant throughput advantages as evidenced here, they also provide much lower and less variance in latency.

50 comments:

LordDoskias26 August 2013 at 19:29
This comment has been removed by the author.
ReplyDelete
Replies
Unknown27 August 2013 at 08:42
Hi Martin
thanks for sharing this excellent benchmark. After taking a peek at the code, I have a couple of questions for you: if I am not mistaken, the lock free implementation relies on an (internal) immutable position class. Don't you think it introduces some (favorable) bias? Likewise, it should introduce some garbage due to new allocation for each movement.
What do you think/where I am wrong?
ReplyDelete
Replies
Unknown27 August 2013 at 15:00
It would be interesting to have the number of retries per operation.
ReplyDelete
Replies
PeLe27 August 2013 at 19:57
Hi Martin,

LockFreeSpaceship beats all your other spaceships, but LockAndGcFreeSpaceship is even faster:

public class LockAngGcFreeSpaceship implements Spaceship {

private final AtomicLong position = new AtomicLong();

@Override
public int move(int xDelta, int yDelta) {
int tries = 0;

long pos, newPos;
do {
++tries;
pos = position.get();
int x = (int) (pos & 0xFFFFFFFFL) + xDelta;
int y = (int) (pos >>> 32) + yDelta;
newPos = (((long) x) & 0xFFFFFFFFL) | (((long) y) << 32);
} while (!position.compareAndSet(pos, newPos));

return tries;
}

@Override
public int readPosition(int[] coordinates) {
long pos = position.get();
coordinates[0] = (int)(pos & 0xFFFFFFFFL);
coordinates[1] = (int)(pos >>> 32);
return 1;
}
}

PeLe
ReplyDelete
Replies
ymo27 August 2013 at 19:58
Hi Martin.

The test requirement is so simple I wonder if the lockfree implementation could be done with Unsafe. You would have a set of already available X and Y positions and basically in each mome you would (try to) increment the sequence within that array. Kinda like disruptor (wink wink). The main benefits of this approach would be to show the *effects* of false sharing from multiple threads in comparison with the AtomicReference implementation as opposed to Unsafe. The introduction of garbage would also be minimal.

Would this be a good scenario ?

Regards
ReplyDelete
Replies
alex21227 August 2013 at 22:59
I've recently discovered your blog and I just want to say thank you for sharing your deep insight on the concurrency. While reading this post I remembered that you've wrote about a lock-free Executor you've implemented a while ago but you never actually explained how it was done. Wouldn't it be a great example of lock-free structure?
ReplyDelete
Replies
Nick28 August 2013 at 03:06
I changed input parameters for your tests and results are not quite the same. My thinking is that 1 or 2 threads are hardly providing enough contention for the hardware you have.

I was running your tests with 200 readers and 200 writers. I also changed TEST_DURATION_MS to 50 seconds.
What I noticed is (however these results may only be specific to my old laptop):
1. The Lock Free test is running ~15-20% longer then the other tests and this may contribute to the "exceptional" performance of the Lock Free implementation.
I saw that Lock Free test is actually running for 60-65 seconds instead of allotted 50.
2. Factoring in #1 I can say that with 200 readers/200 writers the read performance of the StampedLock is on par with Lock Free implementation if not better.

I am curious if you would see similar results if you change these input parameters and maybe even plot new results on the new set of charts.

Thanks
-Nick
ReplyDelete
Replies
Unknown28 August 2013 at 11:51
Great article, it's nice to see the code for all of the synchronization mechanisms in one place. It would be nice if you added javadoc to your Spaceship interface. It took me a while to figure out that the int being returned was the number of attempts.
ReplyDelete
Replies
Cd-MaN29 August 2013 at 07:37
Hi!

Thanks for the blog post and all the others you've written in time. They are very informative.

Unfortunately you just tripped my "why the hell do we call them algorithms lock free?" wire. Yes, I realize that part of the answer is "because it is the convention", but I still need to rant, sorry.

There is no such thing as "free lunch" and these algorithms are not "lock free". It's just that we moved the locks to a different level. What I mean:

- Classic locks (synch / explicit locks): they call the kernel which either obtains a lock or puts the thread in a wait state. The funny thing is that the kernel level locks actually use the atomic CAS operations provided by the CPU to implement locks :-)

- "Lock free" / optimistic concurrency: use the same CAS operations without entering the kernel.

Going even further down to the level of the CPU hardware, CAS operations do generate locks (that's what the LOCK prefix in LOCK CMPXCHG stands for :-)). In the first multicore x86 implementations there was an explicit LOCK signal which was raised, now we have more sophisticated mechanisms trough the MESI protocol, but there are still locks in the sense of "some electrical signal which is asserted by only one party at a time and prevents others from doing an action".

Trying to sum up: optimistic concurrency is a better way than "lock free". There are always locks when we need to establish a causal chain of events (A happened before B). Synchronized uses the thread scheduler (which in turn uses CAS) to wait in an efficient manner (as in: don't burn CPU). Using just CAS might give you lower latency (unless you have 200 threads on 4 cores :-)).

As always, it depends :-).
ReplyDelete
Replies
Yogesh13 September 2013 at 05:35
Hi Martin,
Thanks for posting such a nice topic. I have one question, is it possible to apply lock-free algorithm for operations that involve sequential operations for e.g. generation of sequence number using Java. If yes, can you give hints how it can be done?
ReplyDelete
Replies
Travis6 November 2013 at 05:59
Martin - thanks for another great and in-depth article, as usual.

For what it's worth, RRWL has a poor showing here because the benchmark is pretty much a worst case scenario for that lock, even under heavy read loads, because the critical section in your benchmark is so short (nanoseconds).

Internally, both the RRWL lock() and unlock() operations are themselves getting a lock (a Sync object), manipulating the internal state of the lock (e.g., incrementing the number of readers, or setting the lock to "writer in progress"), then unlocking the internal state. So in fact a lock()/unlock() pair for involves RRWL atomic operations and two (hidden) exclusive critical sections, while a normal synchronized block or ReentrantLock involves two and one respectively - with the difference that the exclusive critical section extends from the lock() to the unlock(), while in the RRWL case the exclusive sections are very short and only exist inside the lock() and unlock() implementations.

When the section of code protected by the lock is reasonably long (say 1 us or longer), the benefit of allowing multiple readers into the protected code is great and will favor RRWL. However, when the protected section takes negligible time, as in the benchmark, the internal locks obtained by the RRWL will dominate the execution time and greatly increase contention, making the RRWL perform very poorly. In effect, it will generally behave at least twice as poorly as a simple lock (the writer preference built in to the lock adds another twist and may result in additional context switching or lock convoys).

I think if the experiment was repeated with a longer critical section (e.g., if the protected data was a 10k element array), then RRWL would prove more useful.

It is possible to write a "lock free" RW lock, where "lock free" means that internally the lock() and unlock() methods don't take any locks while manipulating state - they just CAS some shared state. This means that with readers only flowing in and out of the lock, there are no exclusive sections at all, and the only cost is two atomic operations per reader (one in, one out), plus the occasional retry. Implementing re-entrancy is possible and doesn't change the core performance, but requires some more book keeping.

Final bit of trivia: the way RRWL works, as described above, is responsible for how they show up (and fail to show) in jstack and other tools or APIs that report held/waiting locks. These tools won't show that any thread is inside a RRWL (holds the lock) as either a reader or a writer, because there is no JVM-understood lock held at that point (which are object monitors or things that extend AQS or things like that). The tools will show threads that are *waiting* to enter the lock (and also including, if the timing is exactly right, threads that aren't waiting and won't wait, but are in the middle of the lock() or unlock() methods) because at that point they are blocking on the AQS Sync object internal to the lock. So for RRWL you'll always see zero or more threads waiting for the lock, but even if some are waiting, you'll almost never see any thread owning the Sync object, unless you catch them at the exact moment they are passing through the critical section of the lock() or unlock() methods - which might not even be possible for jstack if there is no safepoint in that tiny critical section.

ReentrantLock has no such issue since its Sync object (derived from AQS) is held for the duration of the lock, so it shows up properly in jstack.
ReplyDelete
Replies
Marvin Hansen23 January 2014 at 13:55
Thank you for sharing this well written article.

While reading, I was just wondering
how Memory Channel Storage (MCS) Architecture[1],
will affects concurrency programming in a couple of years?

If you have a near constant storage latency of ~ 5 to 10ms,
how would you program your limited CPU cache then?

MCS still is in its infancy but considering how it already simplifies system design,
there is a high chance that it might take off within 2 or 3 years. One generation further,
MCS over DDR4 may reaches 2 or 3 GB/s read/write speed which eventually minimizes the storage performance penalty.

Getting the most out of such a "flat" memory/storage hierarchy could need a different thinking about concurrency. Assuming a very low cache miss rate and fast fetching from MCS-storage, what would you pay the most attention to?

Thread contention?

What would you say?

Thank you
[1]
http://electronicdesign.com/memory/memory-channel-storage-puts-ssd-next-cpu
http://www.diablo-technologies.com/files/AMPSMCSInfoSheet-HQ%20ReExport.pdf
http://www.storagesearch.com/ibm-jim-jan2014.html
ReplyDelete
Replies
Unknown11 February 2014 at 07:44
I just run the test code which shows different result from yours,
Enviroment: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz - Linux 3.8.0 - Java 1.7.0_21 x64
3 Reader 3 Writer Read Write
StampedLock 327,258,291 15,279,037
LockFree 47,777,941 12,028,039
ReplyDelete
Replies
bleble2226 March 2014 at 13:34
Really interesting talk on QCon!
But I found your statement about static helpers a bit controversial. Don’t you think that behaviour should be as close as possible to data it operates on? Static helpers make it harder to achieve, they seems to me a bit procedural.
Sure, it all depends on the usecase. I think that there are situations when it is very appropriate to use static function helpers – like matchers and assertions in unit tests
ReplyDelete
Replies
Yi DENG19 March 2014 at 00:19
In the implementation of LockFree version, a new object is created for each write. But efficiency doesn't count the GC time. So, kinda unfair.
ReplyDelete
Replies
Nikolay Tsankov25 March 2014 at 15:38
Hi Martin,

Checking the code on github. I saw you had an Unsafe spaceship implementation, that you later removed...
I was wandering what were the performance results for it and why it got removed.

Cheers,
Nikolay
ReplyDelete
Replies
Erh-Wen,Kuo13 September 2015 at 12:55
Another version of AtomicBufferSpaceship which use Agrona library (UnsafeBuffer) to implement "Spaceship". The performance figure is very similar to "LockAngGcFreeSpaceship".

import java.nio.ByteBuffer;

import uk.co.real_logic.agrona.concurrent.AtomicBuffer;
import uk.co.real_logic.agrona.concurrent.UnsafeBuffer;

public class AtomicBufferSpaceship implements Spaceship{
private final AtomicBuffer buffer = new UnsafeBuffer(ByteBuffer.allocateDirect(8));

@Override
public int readPostion(int[] coordinates) {
long pos = buffer.getLong(0);
coordinates[0] = (int)(pos & 0xFFFFFFFFL);
coordinates[1] = (int)(pos>>>32);
return 1;
}

@Override
public int move(int xDelta, int yDelta) {
int tries = 0;
long pos, newPos;
do{
tries++;
pos = buffer.getLong(0);
int x = (int)(pos & 0xFFFFFFFFL) + xDelta;
int y = (int)(pos>>>32) + yDelta;
newPos = (((long) x) & 0xFFFFFFFFL) | (((long)y)<<32);
}while(!buffer.compareAndSetLong(0, pos, newPos));
return tries;
}
}
ReplyDelete
Replies
Unknown10 June 2018 at 12:30
Could you clarify what you mean with “word packing“ and “cycling memory structures”?

Does word packaging refer to the technique used in the garbage and lock free implementation in PeLe’s comment which packs two integers into one AtomicLong? Are there approaches to make this work for more than two ints and without using Unsafe?

With cycling memory structures, do you mean object pooling? What I mean by that is manually managing the life cycle of the Points object. Instead of creating a new objects, get it from a pool of objects and instead of making it eligible for GC, return it to the pool. This in itself may be implemented with a data structure requiring locks, which would defeat the whole purpose of avoiding locks. However, an object pool can also be implemented on top of a lock free data structure such as a ring buffer.
ReplyDelete
Replies

Add comment