A pointer to a generic method argument

Let’s consider a following method signature of an interface taken from a RampUp interface.


bool Write<TMessage>(ref Envelope envelope, 
    ref TMessage message, IRingBuffer bufferToWrite) 
    where TMessage : struct;

It’s a fairly simple signature, enabling to pass a struct of any size using just a reference to it, without copying it. Now let’s consider the need of obtaining a pointer to this message. Taking a pointer could be needed for various reasons. One could be getting fields by offset, another could be using memcpy for copying the value to any given address. Is it possible to get this pointer in C# code?

No pointers for generic parameters

Unfortunately, you can’t do it in C#. If you try to obtain a pointer to a generic parameter, you’ll be informed about the compiler error. If you can’t do it in C#, is there any other .NET language one could use to get it? Yes, there is. It’s the foundation of .NET programs, the MSIL itself and if it’s MSIL, it means emitting code dynamically.

Ref looks like a pointer

What is a reference to a struct? It looks like a pointer to me. What if we could load it and just assume that it is a pointer? Would CLR accept this program? It occurs that it would. I won’t cover the whole implementation which can be found in here, but want to accent some points.

  • CLR uses the argument with index 0 to passing this. If you want to load a field you need to use the following sequence of operations:
    • Ldloc_0; // load this on the stack
    • Ldfld, “Field1” // pops this loading the value named “Field1” on the stack
  • For Write method, getting a pointer to a message is nothing more than calling an op code: Ldarg_2. As the struct is passed by reference, it can be treated as a pointer by CLR and it will.

I encourage you to download the RampUp codebase and play a little bit with an emitted implementation of the IMessageWriter. Maybe you’ll never need to take the pointer to a generic method parameter (I did), but it’s a good starter to learn a little about emitting code.

Roslyn coding conventions applied

Roslyn is a ‘compiler as a service’ provided for both VisualBasic.NET & C#. It has a thriving community of people providing new features for .NET languages. One of the most important parts of this community is a guideline how to contribute, which defines basic rules for coding and issuing pull requests. The most important part, not only from Roslyn perspective, but as general .NET guidance are Coding Conventions.

Avoid allocations in hot paths

This is the rule, that should be close to every .NET developer heart, not only these that work on a compiler. It’s not about ‘premature optimization’. It’s about writing performant code that actually can sustain its performance when executing its hot paths in majority of the requests. Give it a try, and when writing some code next time (today?) have this rule in mind. Awareness of this kind, may result in having no need for profiling your production system or making a dump just to know that allocating a list for every cell of two dimensional array wasn’t the best approach.

What’s your hot path

That’s a good question that everyone should answer on their system basis. I asked this question a few months ago for my RampUp library:

what’s the hot path for a system using message passing?

The answer was surprisingly obvious: the message passing itself. EventStore, using a similar approach uses classes for message passing. This plus every other object creates some GC pressure. Back then, I asked myself a question, is it possible to use structs for internal process communication and come up with a good way of passing them? My reasoning was following: if I remove the GC pressure from messages, then I remove the hottest path of allocations and this can greatly improve stability of my system. Was it easy? No it wasn’t as I needed to emit a lot of code and discover some interesting properties of CLR. Did it work? Yes, it did.

Next time when you write a piece of code or design a system keep the hot path question in your mind and answer it. It’s worth it.

The art of benchmarking

I’ve been told that Akka can process 50 millions of messages per second on a laptop. This isn’t the number you hear every day, even if you write performance focus applications.

I’ve been recently optimizing my RampUp library and I know that it can perform well, but reaching 50 millions of messages on my 4 hardware cores? That would be a hard thing to do. Possible, maybe, if the test was designed in a way that it groups cores in some way… The current official number is 10 millions msg/s on my laptop and the test uses two producers trying to flood a single consumer. It’s a multi producer single consumer scenario. But let’s go back to the Akka benchmark.

The best performance marked with ‘single machine’ phrase is this. It actually was able to process 48 millions of messages on a single machine! That’s great. Let’s take a look what kind of machine is that

 

  • Processor: 48 core AMD Opteron (4 dual-socket with 6 core AMD® Opteron™ 6172 2.1 GHz Processors)
  • Memory: 128 GB ECC DDR3 1333 MHz memory, 16 DIMMs
  • OS: Ubuntu 11.10
  • JVM: OpenJDK 7, version “1.7.0_147-icedtea”, (IcedTea7 2.0) (7~b147-2.0-0ubuntu0.11.10.1)
  • JVM settings: -server -XX:+UseNUMA -XX:+UseCondCardMark -XX:-UseBiasedLocking -Xms1024M -Xmx2048M -Xss1M -XX:MaxPermSize=128m -XX:+UseParallelGC
  • Akka version: 2.0
  • Dispatcher configuration other than default: 
    parallelism 48 of fork-join-exector
    throughput as described

It’s not a laptop. It’s not a usual single machine. It’s a quite powerful server with a special dispatcher used to get this performance.

I’m not saying, that it’s bad to use good hardware for your tests. I’m not trying to defend RampUp performance, as it does not compete with Akka – it’s for different purposes. I’m just saying that providing benchmarks, shouldn’t be focused on providing number only. There is so much more information needed to give the whole background for a test. Again, the way of communicating number depend if one wants to sell sth providing numbers or provide real results. If you want the second, choose your words wisely.

 

We accept pull requests

There are many protocols of collaboration. There’s famous C4 from 0MQ & others as well. Sometimes a project lives without collaboration rules being explicitly stated for years and can be successful. Sometimes it can suffer from bad assumptions collaborators made when they decided to invest their time in a project. To lower the possibility of any unhealthy friction, I’ve decided to write at least some bullet points for RampUp and created a simple Collaboration.md (I’ve just been sent the very first PR for RampUp :D).

The main issue I wanted to address was that ‘I accept pull requests’. This topic is discussed over and over again and can be described in a following way: .NET devs are not willing to collaborate and rather than providing PRs they just blaim/complain etc. The landscape is changing now and everyone involved in .NET Open Source can feel it for sure. More PRs is being issued, the engagement, so strongly needed is shifting towards an actual participation rather than poking. It doesn’t matter if you provide a new feature and rather than spending days on drawing UML you sketch it and issue a PR or if you spend one additional hour of your debugging time, distilling a PR reproducing a bug you’ve encountered. That’s what real involvement is. It’s not about talking, it’s about making things. Even if a PR is rejected for whatever reason, you learn a lot, you really participate and get a real feedback about your work, not your cheap talk.

I really hope that the new wave of OSS is coming. In that case, let’s swim with the tide.

False sharing is dead, long live the Padded

False sharing is a common problem of multithreaded applications in .NET. If you allocate objects in/for different threads, they may land on the same cache line impacting the performance, limiting gains from scaling your app on a single machine. Unfortunately, because of the multithreaded nature of the RampUp library it’s been suffering from the same condition. I’ve decided to address by providing a tooling rather than going through the whole codebase and apply LayoutKind.Explicit with plenty of FieldOffsets

Padded is born

The easiest and the best way of addressing cross cutting concerns in your .NET apps I’ve found so far is Fody. It’s a post compiler/weaver based on the mighty Mono.Cecil library. The tool has a decent documentation, allowing one to create a even quite complex plugin in a few hours. Because of this advantages I’ve used it already in RampUp but wanted to have something, which can live on its own. That how Padded was born.

Pad me please

Padded uses a very simple technique of adding a dozen of additional fields. According to the test cases provided, they are sufficient enough to provide enough of space to prohibit overlapping with another object in the same cache line. All you need is to:

  1. install Padded in your project (here you can find nuget) in a project that requires padding
  2. declare one attribute in your project:
    namespace Padded.Fody
    {
    public sealed class PaddedAttribute : Attribute { }
    }
    
  3. mark the classes that need padding with this attribute.

Summary

Marking a class/struct with one attribute is much easier than dealing with its layout using .NET attributes, especially, as they were created not for this purpose. Using a custom, small tool to get the needed result is the way to go. That’s how & why Padded was provided.

Ping pong Bruce Lee test

There is a famous Bruce Lee clip showing him as a very good ping pong player using unusual tooling to get the job done. It thought that this ping pong match would be a great story for writing a test for my RampUp library, especially when I provided the first, most likely not final, version of the actor system.

To have more fun I split Bruce Lee into Bruce & Lee. Each part of Bruce Lee either pings or pongs.


public class Bruce : IHandle<Ping>
{
    public IBus Bus;

    public void Handle(ref Envelope envelope, ref Ping msg)
    {
        var p = new Pong();
        Bus.Publish(ref p);
    }
}

public class Lee : IHandle<Pong>
{
    public IBus Bus;

    public void Handle(ref Envelope envelope, ref Pong msg)
    {
        var p = new Ping();
        Bus.Publish(ref p);
    }
}

The ping/pong messages are only markups:


public struct Pong : IMessage {}

public struct Ping : IMessage {}

And the final execution of this setup can be summarized in:


public class Program
{
    public static void Main()
    {
        var system = new ActorSystem();
        IBus bus = null;

        system.Add(new Bruce(), ctx => { bus = ctx.Actor.Bus = ctx.Bus; });
        system.Add(new Lee(), ctx => { ctx.Actor.Bus = ctx.Bus; });

        system.Start();

        var p = new Pong();
        bus.Publish(ref p); // pong as Bruce
        // ... later
        system.Stop();
    }
}

I hope you like the example. I’m aware that ActorSystem API isn’t the best possible API ever, but even in this shape enables me to push RampUp forward.

StructLayoutKind.Sequential not

If you want to write a performant multi threaded application which actually is an aim of RampUp, you have to deal with padding. The gains can be pretty big, considering that the whole work with threads mean, that you need to give them their own spaces to work in.

False sharing

False sharing is nothing more than two or more threads trying to use memory that’s mapped to a single line of cache. The best case for any thread is to have their own memory space separated & by separation I mean having enough of padding on the right and on the left, to keep the spaces of two threads without any overlapping. The easiest way is to add additional 64 bytes (the size of a cache line) at the end and at the beginning of the struct/class to ensure that no other thread will be able to allocate memory close enough. This mechanism is called padding.

Padding

The easiest way to apply padding is applying StructLayoutAttribute. If StructLayoutKind.Sequential is used, then adding 4 Guid fields at the beginning and 4 Guid fields at the end should work just fine. The size of Guid is 16 bytes which give us needed 64 bytes. A harder way of doing it is using StructLayoutKind.Explicit as it requires to add FieldOffsetAttribute to every field of the structure/class, explicitly stating the offset in the memory. With this approach, it’s easy to start with 64 and leave some space at the end of the class.

Problem

StructLayoutKind.Sequential works perfectly. Almost. Unfortunately if any field has type that is not Sequential or Explicit CLR will simply ignore the sequential requirement and silently apply automatic layout ruining the padding. This is a regular case, all classes use Auto by default. Unfortunately it leaves the developer with the need of applying the fields offsets manually.

Solution

As I need this padding behavior for RampUp, I’m creating a small Fody weaver plugin called Padded which will automatically calculate offsets (possibly with some memory overhead) for any class/struct marked with a proper attribute. Hopefully, it will be useful not only for RampUp but for more, performance oriented projects