You can find the Disruptor project here, plus we have a wiki with links to detailed blogs describing how things work.
Naming & API
Over the lifetime of the Disruptor naming has been a challenge. The funny thing is that with the 2.0 release we have come almost full circle. Originally we considered the Disruptor as an event processing framework that often got used as a queue replacement. To make it understandable to queue users we adopted the nomenclature of producers and consumers. However the consumers are not true consumers. With this release the consensus is to return to the event processing roots and adopt the following naming changes.
Producer -> Publisher
Events are claimed in strict sequence and published to the RingBuffer.
Entry -> Event
Events represent the currency of data exchange through the dependency graph of EventProcessors.
Consumer -> EventProcessor
Events are processed by EventProcessors. The processing of an event can be read only, but can also involve mutations on which other EventProcessors depend.
ConsumerBarrier -> DependencyBarrier
Complex graphs of dependent EventProcessors can be constructed for the processing of an Event. The DependencyBarriers are assembled to represent the dependency graph. This topic is the real value of the Disruptor and often misunderstood. A fun example can be seen playing FizzBuzz in our performance tests.
The ProducerBarrier was always a one-to-one relationship with the RingBuffer so for ease of use its behaviour has been merged into the RingBuffer. This allows direct publishing into the RingBuffer.
The most complex part of using the Disruptor is the setting up of the dependency graph of EventProcessors. To simplify this for the most common cases we have integrated the DisruptorWizard project which provides a DSL as a fluent API for assembling the graph and assigning threads.
Significant performance tuning effort has gone into this release. This effort has resulted in a ~2-3X improvement in throughput depending on CPU architecture. For most use cases it is now an order of magnitude better than queue based approaches. On Sandybridge processors I've seen over 50 million events processed per second.
Sequence tracking has been completely rewritten to reduce the usage of hardware memory barriers, indirection layers, and megamorphic method calls resulting in a much more data and instruction cache friendly design. New techniques have been employed to prevent false sharing because the previous ones got optimised out by the Oracle Java 7 JVM.
The one area not seeing a significant performance increase is the sequencer pattern. The Disruptor is still much faster than queue based approaches for this pattern but a limitation of Java hits us hard here. Java on x86/x64 is using LOCK CMPXCHG for CAS operations to implement the AtomicLong incrementAndGet() method which, based on my measurements, is ~2-10X slower than using LOCK XADD as contention increases. Hopefully Oracle will see the error of SUNs ways on this and embrace x86/x64 to take advantage of such instructions. Dave Dice at Oracle has blogged on the subject so I live in hope.
Of special note for this release is the elimination of hardware memory barriers on x86/x64 for Sequence tracking. The beauty in the Disruptor design is that on CPU architectures that have a memory model  whereby:
- “loads are not reordered with older loads”, and
- “stores are not reordered with older stores”;
it is then possible to take advantage of the semantics provided by AtomicLong to avoid the use of the Java volatile keyword, and thus hardware fences on x86/x64. The one sticky rule for concurrent algorithms, such as Dekker  and Peterson  locks, on x86/x64 is “loads can be re-ordered with older stores”. This is not an issue given the design of the Disruptor. The issue relates to the snooping of CPU local store buffers for older writes. I’m likely to blog in more detail about why this is the case at a later date. The code should be safe on other CPU architectures if the JVM implementers get the semantics of AtomicLong and Unsafe correct, however your mileage may vary for performance on other architectures compared to x64.
With this latest release it is becoming increasingly obvious how sensitive some CPU architectures are to processor affinity for threads. When an EventProcessor gets rescheduled on a different core, after its time-slice is exhausted or it yields, the resulting cache pollution really hits performance. For those who require more extreme and predictable performance I plan to release an Executor service with the Disruptor to allow the pinning of threads to CPU cores.
I'm also thinking of adding a progressive back off strategy for waiting EventProcessors as a WaitStrategy. This strategy would first busy spin, then yield, then eventually sleep in millisecond periods to conserve CPU resource for those applications that burst for a while then go quiet.
- Memory Model: See Section 8.2 of http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
- Dekker algorithm: http://en.wikipedia.org/wiki/Dekker%27s_algorithm
- Peterson Algorithm: http://en.wikipedia.org/wiki/Peterson%27s_algorithm