Concurrency – ramp up your skills

Yesterday, I gave my Extreme Concurrency talk at rg-dev user group. After the talk I had some really interesting discussions and was asked to provide some resources in the low level concurrency I was talking about. So here’s the list of books, talks and blog posts that can help you to ramp up your skills

Videos:

  1. [C++] Herb Sutter “atomic Weapons” – it’s about C++ but covers memory models in a way, that’s easy to follow and learn how it works
    1. Part 1
    2. Part 2

MSDN:

  1. .NET Volatile class – it has a good description of what half-barriers are and properly shows two counterparts Read & Write methods
  2. .NET Interlocked class – the other class with a good description providing methods that are executed atomically. Basically, these methods are JITted as single assembler operations.

Code:

  1. RampUp – a project of mine 🙂
  2. [JAVA] Aeron – the messaging library

Books:

  1. Concurrent programming on Windows by Joe Duffy – this is a hard book to go through. It’s demanding and requires a lot of effort but is the best book if you want to really understand this topic

Blogs:

  1. Volatile reads and writes by Joe Duffy
  2. Sayonara volatile by Joe Duffy
  3. Atomicity, volatility and immutability are different by Eric Lippert – that’s the last part of this series
  4. [JAVA] Psychosomatic, lobotomy, saw – the name is strange but you won’t find here disturbing videos. What you’ll find though, is a deep-dive into memory models.

RampUp journey ends

My RampUp project has been an important journey to me. I verified my knowledge about modern hardware and dealt with interesting cases writing a really low level code in .NET/C#. Finally, writing code that does not allocate is demanding but doable. This journey ends today.

RampUp has been greatly influenced by Aeron, a tool for publishing your messages really fast. You can easily saturate the bandwidth and get very low service time. Fortunately, for .NET ecosystem, Adaptive Consulting has just opened an Aeron client port here. This is unfortunate for RampUp as this port covers ~70-80% of RampUp implementation. Going on with RampUp would mean, that I need to chase Aeron and compete somehow with Aeron.NET to get an .NET environment attention, and in result, participation (PRs. issues, actual use cases). This is what I don’t want to do.

RampUp source code won’t be removed as still, I find it interesting and it shows some really interesting low level cases of CLR. I hope you enjoyed the ride. Time to move on 🙂

Single producer single consumer optimizations

The producer-consumer relationship is one of the most fundamental cooperation patterns. Some components produce values, issues requests and some consume/handle them. Depending on the number of components at the end of this dependency it’s called ‘single/multi producer single/multi consumer’ relationship. It’s important to make this choice explicit, because as with every explicit choice, it enables some optimizations. I’d like to share some thoughts o the optimizations taken in the single consumer single producer scenario in the RampUp library provided by OneToOneRingBuffer.

The behavior of ring buffers in RampUp is ported from Java’s Agrona. They provide a queue that enables reading sequentially on the consumer side. The reasoning behind it is that sequential reads are CPU friendly, so that consumer can process messages much quicker. For ManyToOneRingBuffer the production part is quite complex. It proceeds as follows:

  1. check against the consumer position, is there enough of space
  2. allocate a slot in the ring (this is done with Interlocked operations, in a loop, may take a while)
  3. write a header in an ordered way (using volatile)
  4. put data
  5. write the header again marking the message as published

This brings a lot of unneeded work for a single producer. When considering a single producer, there’s nothing to compete with. The only check that needs to be made is that the producer does not overlap with the consumer. So the algorithm looks as follows:

 

  1. check against the consumer position, is there enough of space
  2. put data
  3. write the header again marking the message as published
  4. write the tail value for future writes

Removal of Interlocked and lowering the number of Volatile operations can improve the producer performance greatly (less synchronization).

 

If you wanted to compare these two on your own, here you are: ManyToOne and OneToOne.

Happy producing (and consuming).

A pointer to a generic method argument

Let’s consider a following method signature of an interface taken from a RampUp interface.


bool Write<TMessage>(ref Envelope envelope, 
    ref TMessage message, IRingBuffer bufferToWrite) 
    where TMessage : struct;

It’s a fairly simple signature, enabling to pass a struct of any size using just a reference to it, without copying it. Now let’s consider the need of obtaining a pointer to this message. Taking a pointer could be needed for various reasons. One could be getting fields by offset, another could be using memcpy for copying the value to any given address. Is it possible to get this pointer in C# code?

No pointers for generic parameters

Unfortunately, you can’t do it in C#. If you try to obtain a pointer to a generic parameter, you’ll be informed about the compiler error. If you can’t do it in C#, is there any other .NET language one could use to get it? Yes, there is. It’s the foundation of .NET programs, the MSIL itself and if it’s MSIL, it means emitting code dynamically.

Ref looks like a pointer

What is a reference to a struct? It looks like a pointer to me. What if we could load it and just assume that it is a pointer? Would CLR accept this program? It occurs that it would. I won’t cover the whole implementation which can be found in here, but want to accent some points.

  • CLR uses the argument with index 0 to passing this. If you want to load a field you need to use the following sequence of operations:
    • Ldloc_0; // load this on the stack
    • Ldfld, “Field1” // pops this loading the value named “Field1” on the stack
  • For Write method, getting a pointer to a message is nothing more than calling an op code: Ldarg_2. As the struct is passed by reference, it can be treated as a pointer by CLR and it will.

I encourage you to download the RampUp codebase and play a little bit with an emitted implementation of the IMessageWriter. Maybe you’ll never need to take the pointer to a generic method parameter (I did), but it’s a good starter to learn a little about emitting code.

False sharing is dead, long live the Padded

False sharing is a common problem of multithreaded applications in .NET. If you allocate objects in/for different threads, they may land on the same cache line impacting the performance, limiting gains from scaling your app on a single machine. Unfortunately, because of the multithreaded nature of the RampUp library it’s been suffering from the same condition. I’ve decided to address by providing a tooling rather than going through the whole codebase and apply LayoutKind.Explicit with plenty of FieldOffsets

Padded is born

The easiest and the best way of addressing cross cutting concerns in your .NET apps I’ve found so far is Fody. It’s a post compiler/weaver based on the mighty Mono.Cecil library. The tool has a decent documentation, allowing one to create a even quite complex plugin in a few hours. Because of this advantages I’ve used it already in RampUp but wanted to have something, which can live on its own. That how Padded was born.

Pad me please

Padded uses a very simple technique of adding a dozen of additional fields. According to the test cases provided, they are sufficient enough to provide enough of space to prohibit overlapping with another object in the same cache line. All you need is to:

  1. install Padded in your project (here you can find nuget) in a project that requires padding
  2. declare one attribute in your project:
    namespace Padded.Fody
    {
    public sealed class PaddedAttribute : Attribute { }
    }
    
  3. mark the classes that need padding with this attribute.

Summary

Marking a class/struct with one attribute is much easier than dealing with its layout using .NET attributes, especially, as they were created not for this purpose. Using a custom, small tool to get the needed result is the way to go. That’s how & why Padded was provided.

Producer – consumer relationship

In the last post about the RampUp library I covered on of the foundations: IRingBuffer. Now I’d like to describe the contract it fulfills.

Producer consumer

If you take a look at the IRingBuffer you’ll see Write/Read methods. These two are responsible for producing/consuming or writing/reading messages to the buffer in FIFO way. What are the guarantees behind such an interface? What about concurrent accessing this structure?

Multimulti

The easiest approach for distinguishing accessing patterns is considering whether the structure could be accessed by one or more threads. If you consider producer/consumer you’ll see that there are four options:

  1. SPSC – Single Producer Single Consumer – only one thread produces items, and another consumes them
  2. MPSC – Multi Producer Single Consumer – multiple threads may produce items in a safe manner, again there’s a single consuming thread
  3. SPMC – Single Producer Multi Consumer – this could be treated as a distributor of work
  4. MPMC – Multi Producer Multi Consumer – multi/multi, ConcurrentQueue is a good example of it

Unfortunately nothing comes for free. If you want to get multi, you’ll pay the price of handling friction on that end. On the other hand, if you want to design a system where queues provide transport between different parts of the system, you’ll need to enable multiple producers for sure, as there’s going to be more than one system element. If you want to process items in an order they appeared and leave the locking issues, just write a fast single threaded code, a single consumer with a single worker thread would be the way to go. Of course this worker thread may access other queues and produce items for them (hence, multi producer is needed).

The ring buffer implementation in RampUp provides exactly MPSC behavior, as it’s prepared to handle items in order, by a single thread.

Unsafe buffer in RampUp

As we’re moving towards the core of RampUp, before visiting the most important parts we need to discuss one abstraction provided in the library named IUnsafeBuffer. The abstraction provides an interface a’la Stream wrapping a set of operations over a stream of an unmanaged memory. Currently, there’s only one implementation using VirtualAlloc, but it’s highly probable that in the future memory mapped files will be used as it’s the easiest way of providing a cross-process visible memory. Now let’s take a look at the following members of IUnsafeBuffer


public interface IUnsafeBuffer : IDisposable
{
    int Size { get; }
    unsafe byte* RawBytes { get; }
    AtomicLong GetAtomicLong(long index);
    AtomicInt GetAtomicInt(long index);
    void Write(int offset, ByteChunk chunk);
    void ZeroMemory(int start, int length);
}

As you can see the provided methods are a bit similar to the operations provided by the Stream. Currently pointer is leaked with RawBytes, but this is a subject to change. What’s important, the unsafe buffer provides way to get atomic wrappers when needed. The atomics are designed to be used with any unmanaged memory provider, especially with IUnsafeBuffer. This API will be needed in creating the next data structure in RampUp