Mechanical Sympathy: Processor Affinity

Tuesday, 19 July 2011

Processor Affinity - Part 1

In a series of articles I’ll aim to show the performance impact of processor affinity in a range of use cases.

Background

A thread of execution will typically run until it has used up its quantum (aka time slice), at which point it joins the back of the run queue waiting to be re-scheduled as soon as a processor core becomes available. While running the thread will have accumulated a significant amount of state in the processor, including instructions and data in the cache. If the thread can be re-scheduled to run on the same core as last time it can benefit from all that accumulated state. A thread may equally not run to the end of its quantum because it has been pre-empted, or blocked on IO or a lock. After which, when it is ready to run again, the same holds true.

There are numerous techniques available for pinning threads to a particular core. In this article I’ll illustrate the use of the taskset command on two threads exchanging IP multicast messages via a dummy interface. I’ve chosen this as the first example because in a low-latency environment multicast is the preferred IP protocol. For simplicity, I’ve also chosen to not involve the physical network while introducing the concepts. In the next article I’ll expand on this example and the issues involving a real network.

1. Create the dummy interface

$ su -
$ modprobe dummy
$ ifconfig dummy0 172.16.1.1 netmask 255.255.255.0
$ ifconfig dummy0 multicast

2. Get the Java files (Sender and Receiver) and compile them

$ javac *.java

3. Run the tests without CPU pinning

Window 1:
$ java MultiCastReceiver 230.0.0.1 dummy0

Window 2:
$ java MultiCastSender 230.0.0.1 dummy0 20000000

4. Run the tests with CPU pinning

Window 1:
$ taskset -c 2 java MultiCastReceiver 230.0.0.1 dummy0

Window 2:
$ taskset -c 4 java MultiCastSender 230.0.0.1 dummy0 20000000

Results

The tests output once per second the number of messages they have managed to send and receive. A typically example run is charted in Figure 1 below.

Figure 1.

The interesting thing I've observed is that the unpinned test will follow a step function of unpredictable performance. Across many runs I've seen different patterns but all similar in this step function nature. For the pinned tests I get consistent throughput with no step pattern and always the greatest throughput.

This test is not particularly CPU intensive, nor does it access the physical network device, yet it shows how critical processor affinity is to not just high performance but also predictable performance. In the next article of this series I'll introduce a network hop and the issues arising from interrupt handling.

19 comments:

billywhizz19 July 2011 at 23:24
i see the same results here. also, if you make sure to pin the two processes to cores that share the same L2 cache you get double the throughput over two cores on different L2 caches. I presume this is the overhead of the cache interconnect?
ReplyDelete
Replies
Stephen Souness20 July 2011 at 00:26
Hi Martin.

No doubt you will already have this in mind for a future post, but I am curious about what sort of constraints you may have in place for ensuring that other threads are not utilising the resources of the CPUs that the sender and receiver processes (obviously single-threaded) have affinity to.
ReplyDelete
Replies
Martin Thompson20 July 2011 at 07:24
When sharing the same L2 cache I'm assuming you are using a pre-Nehalem Intel processor such as Penryn? If so, you are seeing the benefits of exchanging data via the L2 rather than the L3 cache as in my test. This will obviously be faster between two cores but does not scale to more cores as well as the Nehalem processors do. Most processors now operate a 3 layer cache with only the third level shared if you discount hyper threading.
ReplyDelete
Replies
Martin Thompson20 July 2011 at 07:27
taskset is the cheap and cheerful means of setting affinity. Other means exist such as cgroups which can be used to contain OS threads for avoiding contention with the cores assigned to specific tasks. I used taskset for quick illustration of what is possible.
ReplyDelete
Replies
billywhizz20 July 2011 at 19:15
i've used taskset in the past to pin init and everything under it to one core and then have my "soft-realtime" processes pinned to the other cores on the box. this way the OS shouldn't interfere with any of your application processes. Idea is to always have at least one core dedicated to the OS. Linux containers and cgroups are also well worth investigating...
ReplyDelete
Replies
Xin Wang29 July 2011 at 03:26
How about processor affinity for interruptions? Do you think if it is good practice to dedicate one cpu for interruption handling?
ReplyDelete
Replies
Martin Thompson29 July 2011 at 06:56
Dedicating a CPU for interrupt handling can be a very valid technique for certain types of workload. It is one of the points I plan to cover in the next instalment of this series.
ReplyDelete
Replies
Smartdreamer28 November 2011 at 19:41
Martin, this is a great post.

You finish by mentioning "In the next article of this series [...]". And as the title suggest, there should be a Part 2. Where is it? Eagerly waiting for it.

Continue the great work!
ReplyDelete
Replies
Anonymous29 March 2012 at 17:45
You have observational evidence that pinning helps which is good but you assign the cause as being accumulated processor state. How did you reach that conclusion?

I base that question on the following - when the next thread is scheduled to run all the processor registers, cache-lines etc. will be loaded for that thread effectively flushing all your currents threads state (indeed the OS should save all that state for you). This will continue for subsequent threads until your thread is re-scheduled to run on that processor.

Regards,
Matt
ReplyDelete
Replies
Alex Lam17 November 2014 at 07:14
Hi,

For the dummy interface part, can I just use lo interface and 127.0.0.1 instead?

Alex
ReplyDelete
Replies
Ivan Mushketyk23 September 2016 at 13:17
Hi Martin,

How will this work if a process has more than one thread? Will it pin all threads or will it pin only the main thread?
ReplyDelete
Replies
stachu18 November 2016 at 12:50
Hi, great article, thx!
Is there part II released? It sounds like you were to describe some interesting stuff - interrupt handling.
Cheers,
Michał
ReplyDelete
Replies
sha20 December 2016 at 05:51
Hi Martin,

The links to source code (Sender and Receiver) is broken.
Could you please update them?
ReplyDelete
Replies