Roslyn coding conventions applied

Roslyn is a ‘compiler as a service’ provided for both VisualBasic.NET & C#. It has a thriving community of people providing new features for .NET languages. One of the most important parts of this community is a guideline how to contribute, which defines basic rules for coding and issuing pull requests. The most important part, not only from Roslyn perspective, but as general .NET guidance are Coding Conventions.

Avoid allocations in hot paths

This is the rule, that should be close to every .NET developer heart, not only these that work on a compiler. It’s not about ‘premature optimization’. It’s about writing performant code that actually can sustain its performance when executing its hot paths in majority of the requests. Give it a try, and when writing some code next time (today?) have this rule in mind. Awareness of this kind, may result in having no need for profiling your production system or making a dump just to know that allocating a list for every cell of two dimensional array wasn’t the best approach.

What’s your hot path

That’s a good question that everyone should answer on their system basis. I asked this question a few months ago for my RampUp library:

what’s the hot path for a system using message passing?

The answer was surprisingly obvious: the message passing itself. EventStore, using a similar approach uses classes for message passing. This plus every other object creates some GC pressure. Back then, I asked myself a question, is it possible to use structs for internal process communication and come up with a good way of passing them? My reasoning was following: if I remove the GC pressure from messages, then I remove the hottest path of allocations and this can greatly improve stability of my system. Was it easy? No it wasn’t as I needed to emit a lot of code and discover some interesting properties of CLR. Did it work? Yes, it did.

Next time when you write a piece of code or design a system keep the hot path question in your mind and answer it. It’s worth it.

False sharing is dead, long live the Padded

False sharing is a common problem of multithreaded applications in .NET. If you allocate objects in/for different threads, they may land on the same cache line impacting the performance, limiting gains from scaling your app on a single machine. Unfortunately, because of the multithreaded nature of the RampUp library it’s been suffering from the same condition. I’ve decided to address by providing a tooling rather than going through the whole codebase and apply LayoutKind.Explicit with plenty of FieldOffsets

Padded is born

The easiest and the best way of addressing cross cutting concerns in your .NET apps I’ve found so far is Fody. It’s a post compiler/weaver based on the mighty Mono.Cecil library. The tool has a decent documentation, allowing one to create a even quite complex plugin in a few hours. Because of this advantages I’ve used it already in RampUp but wanted to have something, which can live on its own. That how Padded was born.

Pad me please

Padded uses a very simple technique of adding a dozen of additional fields. According to the test cases provided, they are sufficient enough to provide enough of space to prohibit overlapping with another object in the same cache line. All you need is to:

  1. install Padded in your project (here you can find nuget) in a project that requires padding
  2. declare one attribute in your project:
    namespace Padded.Fody
    {
    public sealed class PaddedAttribute : Attribute { }
    }
    
  3. mark the classes that need padding with this attribute.

Summary

Marking a class/struct with one attribute is much easier than dealing with its layout using .NET attributes, especially, as they were created not for this purpose. Using a custom, small tool to get the needed result is the way to go. That’s how & why Padded was provided.

StructLayoutKind.Sequential not

If you want to write a performant multi threaded application which actually is an aim of RampUp, you have to deal with padding. The gains can be pretty big, considering that the whole work with threads mean, that you need to give them their own spaces to work in.

False sharing

False sharing is nothing more than two or more threads trying to use memory that’s mapped to a single line of cache. The best case for any thread is to have their own memory space separated & by separation I mean having enough of padding on the right and on the left, to keep the spaces of two threads without any overlapping. The easiest way is to add additional 64 bytes (the size of a cache line) at the end and at the beginning of the struct/class to ensure that no other thread will be able to allocate memory close enough. This mechanism is called padding.

Padding

The easiest way to apply padding is applying StructLayoutAttribute. If StructLayoutKind.Sequential is used, then adding 4 Guid fields at the beginning and 4 Guid fields at the end should work just fine. The size of Guid is 16 bytes which give us needed 64 bytes. A harder way of doing it is using StructLayoutKind.Explicit as it requires to add FieldOffsetAttribute to every field of the structure/class, explicitly stating the offset in the memory. With this approach, it’s easy to start with 64 and leave some space at the end of the class.

Problem

StructLayoutKind.Sequential works perfectly. Almost. Unfortunately if any field has type that is not Sequential or Explicit CLR will simply ignore the sequential requirement and silently apply automatic layout ruining the padding. This is a regular case, all classes use Auto by default. Unfortunately it leaves the developer with the need of applying the fields offsets manually.

Solution

As I need this padding behavior for RampUp, I’m creating a small Fody weaver plugin called Padded which will automatically calculate offsets (possibly with some memory overhead) for any class/struct marked with a proper attribute. Hopefully, it will be useful not only for RampUp but for more, performance oriented projects

Replacing a generic dictionary

There is a moment, when you profile your high-throughput system and you hit the wall. And it’s not your code but some BCL elements. That’s what happened in RampUp when I was profiling Write part of the buffer.

The writer is emitted, but as the very foundation it uses a dictionary of metadata stored per message type. The metadata are simple:

  • the message size
  • the offset of the message envelope

Before optimization it was using the generic Dictionary specified with the message type and the message metadata. It was horribly slow. As the set of messages does not change, you could easily provide a set of message types up-front. For each message type one can obtain RuntimeTypeHandle, which can be easily converted to long. With a set of longs, you could select minimal long and just subtract it from all the values. This would reduce the value to int. So here you are. You have a way of turning a type into int and ints are much easier to compare & to handle. One could even use a simple hash-map just to map between ints and metadata. This was the way to reduce the overhead of obtaining metadata with IntLookup<TValue>. After applying the change, write performance has increased by 10%.

Atomic* in RampUp

In the two recent posts we’ve build strong/relaxed foundations to take a very first look at some parts of RampUp library.

Structs vs classes

The differences seem to be obvious. Structs are allocated on the stack (they are on heap when allocated in arrays), are passed by value unless passed by ref or out, should be small as they are copied by value as well. Instances of classes are allocated on the heap, are passed by reference. If we considered a proper wrapper for all the operations related to the value of a long, including operations related to the class Volatile, we’d need to use a class, as a struct would be copied, hence referring to the same address would be impossible, right? Not exactly

Hello pointer

Let’s take a look at the AtomicLong implementation now. It’s a struct. Why can it be a struct? Oh, wait a second. It doesn’t have a long field representing the value. It has a long* field, representing the address of the value! Now, when the AtomicLong is copied, it has the address copied, not the value, hence, it preserves the value and all the operations that were issued against it. How one can obtain a pointer to long which will be valid forever? It can be done in many ways, but one of them is to allocate an offheap, unmanaged memory using for instance Marshal.AllocHGlobal. RampUp does it in a different way (still, using unmanaged memory) but the result is the same. You have a nice wrapper around all the operations for a single long gathered in one place.

Ref vs pointer

The most interesting parts of this class struct are methods using ref. As you can see below to use Volatile.Read the pointer is first dereferenced and then, the reference to the value is taken. How does it work?

[Pure]
public int VolatileRead()
{
    return Volatile.Read(ref *_ptr);
}

To answer this question, you can compile RumpUp and use an IL viewer (DotPeek, ILSpy, Ildasm, whatever you want). You’ll see an interesting thing. The pointer is passed to the method without any modifications, in other words, dereferencing and then taking a reference to the obtained value negate each other resulting in the pointer being passed to the Volatile.Read.

Summing up

This simple wrapper around all the memory & memory barriers related operations for long is one of the pillars of RampUp. Now we can move forward with more advanced operations.

Memory models relaxed foundations

In the previous post we’ve built up some basic foundations for total ordering of CPU instructions. It was mentioned that this total ordering is much too strict and if we want to use the hardware in the best possible way, there’s a need of relaxed barriers, that allow some reorderings, especially when considering high-througput libraries like RampUp.

2×2 = 4

From the CPU perspective there are only two operations that can be performed on a memory location. These are:

  • Store – a value is read from the memory
  • Load – a value is written to the memory

If you consider an order of two operations you can get following chains (pairs):

  • Load-Store
  • Load-Load
  • Store-Load
  • Store-Store

These for pairs are enough to consider different reorderings. One can easily imagine reordering elements in each of these pairs. Knowing these pairs and the definition of the full memory barrier, one can easily reason, that the full barrier prohibits all of the reorderings mentioned above.

Volatile this, volatile that

There’s a class in .NET, used very infrequently, called Volatile. It has only two methods, with many overloads. They’re Read and Write. Here’s their semantics:

  • Volatile.Read – is equal to the following sequence of operations:
    • read the value from the memory
    • issue a barrier prohibiting Load-Store & Load-Load reorderings
  • Volatile.Write – is equal to the following sequence of operations:
    • issue a barrier prohibiting Store-Store & Load-Store reorderings
    • write the value to the memory

These two methods should be thought of as siblings and should be used together. If one thread writes value with a Volatile.Write & the other reads with Volatile.Read, the value written by the first will be visible to the second after a while, in other words, reading the value in the loop finally will result in the value written by the first thread. This behavior connected with the barriers and disabling some of the reorderings may create an extremely performant approaches.

Summing up & RampUp implications

Now, the memory barriers, reorderings are known a bit better. You know, that an efficient & performant code is a code that lets a CPU for reorderings to optimize the pipeline. You know also, that beyond full memory barriers there are much less obstructive methods that can be used to apply partial ordering of instructions. As our knowledge about memory expands, we’re getting closer to analyze the very first element of the RampUp, AtomicLong & AtomicInt.

Memory models foundations

RampUp is aimed at providing a very performant and aligned to nowadays hardware abstraction over modern CPUs. Although for using it one doesn’t have to know all the corner cases of memory models, it’s worth to know foundations supporting the RampUp library. This post touches the tip of the memory models iceberg. It’s crucial to embrace that knowledge if one wants to write truly predictably performant code. And that’s my aim in RampUp.

Code is executed in a sequential way

It’s a lie. It isn’t executed in a sequential way. Considering the well known pyramid of latency, consider times spent on accessing

  1. CPU registry/operation – 1ns
  2. CPU cache – 10ns
  3. DRAM access – 60ns
  4. Disk, network – much more

If CPU waited just to get a value from DRAM before every operation, that would be highly inefficient. That’s why caches were introduced. They are much closer to CPU and the look up takes much less. But you need to fill the cache. How is it done? In an optimal situation, CPU, before executing a given code chunk can evaluate all the needed addresses and order their prefetch to the CPU caches. Sometimes, if given ‘lines of code’ does not depend on each other, they may be reordered for later execution just to do not wait on the values being fetched from memory. This behavior depends on the CPU architecture, but if you want to sharpen your memory ordering related skills, I’m encouraging you to spend some time on considering these reorderings.

Memory barriers

We can’t live without order and neither can our code. The question is how much order do we need. If all the CPU instructions were ordered in a strict order that would bring the starvation by waiting for the values being fetched from RAM. What are the ways of ensuring this ordering without destroying all the optimization possibilities connected with ordering. The mechanism for this is called memory barriers.

The most common one is the full barrier. This barrier creates a point that cannot be crossed by any operation: all the operations before this point will be executed before, all the operations located after – after. This type of barrier is issued for instance when:

  • using lock statement – that’s why when you use locks the behavior is so predictable
  • using Interlocked atomic operations like Add, CompareExchange. The operations are atomic but additionally, they introduce a full barrier (this applies to Intel x86/x64 processors)
  • using Thread.MemoryBarrier
  • using anything that uses two above (as operations are set in a particular order, so will the code using methods using above).

If there are full barriers, are there any other types? Yes, there are. We’ll cover them in the next post.