Wire improvements


This post sums up my recent involvement in the Wire serializer being built by the Akka.NET team. It was an interesting OSS experience. I’m glad that I could improve the performance of this serializer .

Box me not

The initial design of constructing a serializer of a complex object was based on emitting a pair of delegates (serialize, deserialize). Let’s consider deserialization:

  1. Create object
  2. For each field:
    1. Use the following method of a ValueSerializer:

      public abstract object ReadValue(
         Stream stream, DeserializerSession session)
    2. Cast to the field type or ubox if the field is a value type
    3. Load the deserialized object and set the field.

The problem with this approach was manifesting for classes with many primitive properties/fields the cost of boxing/unboxing might be quite high. On the other hand, one couldn’t be calling a method returning a common type without boxing for value types. How would you approach this? My answer was to move emitting point 1 & 2 to the value serializer and provide a generic implementation for this emitting in the basic class. That allowed me to customize the emitting for primitive value types like int, long, Guid still preserving a possibly bit less performing generic approach.

After the change and delegating the emitting to the ValueSerializer class, it was given two new virtual methods with their implementation using boxing etc, but leaving much more space for special cases

public virtual int EmitReadValue(Compiler<ObjectReader> c, 
   int stream, int session, FieldInfo field) 
public virtual void EmitWriteValue(Compiler<ObjectWriter> c, 
   int stream, int fieldValue, int session) 

Call me once

Wire uses a BCL’s Stream abstraction to work with a stream of bytes. When using a stream, a conversion to primitive value types sometimes require a helper byte array to get all the data in one Read call. For instance, a long value is stored as 8 bytes, hence it requires an 8 byte array. To remove allocations, during serialization and deserialization, a byte chunk can be obtained by calling GetBuffer method on the session object. What if your object contains 4 longs. Should every serializer call this method or maybe it should be called once and the result stored in a variable?

I took the second approach and by removing this additional calls to GetBuffer was able to squeeze again a bit more from Wire.

Summing up

Working on these features was a very pleasant experience. It took me just a few evenings and I was able to make a positive impact. And this is really nice.

.NET volatile write performance degradation in x86


This is a summary of my investigation about writing a fast and well designed concurrent queue for akka.net which performance was drastically low for 32bit application. Take a look at PR here. If you’re interested in writing a well performing no-alloc applications with mechanical symapthy in mind or you’re simply interested in good .NET concurrency this post is for you.


Akka.NET is an actor system’s implementation for .NET platform. It has been ported from Java. Recently I spent some time playing with it and reading through the codebase. One of the classes that I took a look into was UnboundedMailboxQueue using a general purpose .NET BCL’s concurrent queue. It looked strange to me, as knowing a structure Envelope that is passed through this queue one could implement a better queue. I did it in this PR lowering number of allocations by 10% and speeding up the queue by ~8%. Taking into consideration that queues are foundations of akka actors, this result was quite promising. I used the benchmark tests provided with the platform and it looked good. Fortunately Jeff Cyr run some tests on x86 providing results that were disturbing. On x86 the new queue was underperforming. Finally I closed the PR without providing this change.

The queue design

The custom queue provided by use a similar design to the original concurrent queue. The difference was using Envelope fields (there are two: message & sender) to mark message as published without using the concurrent queue state array. Again, knowing the structure you want to passed to the other side via a concurrent queue was vital for this design. You can’t make a universal general collection. Note ‘general’, not ‘generic’.


To make the change finally visible to a queue’s consumer, Volatile.Write was used. The only difference was the type being written. In the BCL’s concurrent queue that was bool in an array. In my case it was an object. Both used different overloads of Volatile.Write(ref ….). For sake of reference, Volatile.Write ensures release barrier so if a queue’s consumer reads status with Volatile.Read (the aquire barrier), it will finally see the written value.

Some kind of reproduction

To know how .net is performing this operations I’ve used two types and run a sample application with x64 and x86. Let’s take a look at the code first.

struct VolatileInt
int _value;

public void Write(int value)
_value = value;

public void WriteVolatile(int value)
Volatile.Write(ref _value, value);

struct VolatileObject
object _value;

public void Write(object value)
_value = value;

public void WriteVolatile(object value)
Volatile.Write(ref _value, value);

It’s really nothing fancy. These two either write the value ensuring release fence or just write the value.

Windbg for x86

The methods had been prepared using RuntimeHelpers.PrepareMethod(). A Windbg instance was attached to the process. I loaded sos clr and took a look at method tables of these two types. Because methods had been prepared, they were jitted so I could easily take a look at the jitted assembler. Because x64 was performing well, let’s take a look at x86. At the beginning let’s check the non-object method, VolatileInt.VolatileWrite

cmp     byte ptr [ecx],al
mov     dword ptr [ecx],edx

Nothing heavy here. Effectively, just move a memory and return. Let’s take a look at writing the object with VolatileObject.VolatileWrite

cmp     byte ptr [ecx],al
lea     edx,[ecx]
call    clr!JIT_CheckedWriteBarrierEAX

Wow! Beside moving some data an additional method is called. The method name is JIT_CheckedWriteBarrierEAX (you probably this now that there may be a group of JIT_CheckedWriteBarrier methods). What is it and why does it appear only in x86?

CoreCLR to the rescue

Take a look at the following snippet and compare blocks for x86 and non-x86? What can you see? For x86 there are additional fragments, including the one mentioned before JIT_CheckedWriteBarrierEAX. What does it do? Let’s take a look at another piece of CoreCLR here. Let’s not dive into this implementation right now and what is checked during this call, but just taking a look at first instructions of this method one can tell that it’ll cost more than the simple int operation

cmp edx,dword ptr [clr!g_lowest_address]
jb      clr!JIT_CheckedWriteBarrierEAX+0x35
cmp     edx,dword ptr [clr!g_highest_address]
jae     clr!JIT_CheckedWriteBarrierEAX+0x35
mov     dword ptr [edx],eax
cmp     eax,dword ptr [clr!g_ephemeral_low]
jb      clr!JIT_CheckedWriteBarrierEAX+0x37
cmp     eax,dword ptr [clr!g_ephemeral_high]
jae     clr!JIT_CheckedWriteBarrierEAX+0x37
shr     edx,0Ah

Summing up
If you want to write well performing code and truly want to support AnyCPU, a proper benchmark tests run with different architectures should be provided.
Sometimes, a gain in one will cost you a lot in another. Even if for now this PR didn’t make it, this was an interesting journey and an extreme learning experience. There’s nothing better that to answer a childish ‘why?’ on your own.

Task.WhenAll tests

In the last post I’ve shown some optimizations one can apply to reduce the overhead on creating asynchronous state machines. Let’s dive into the async world again and consider helper methods provided by the Task class, especially Task.WhenAll.

public static Task WhenAll(params Task[] tasks)

The method works in a following way. It accepts an array of tasks and returns a task that will finish as soon as all of the underlying tasks are finished. Applying this method in some scenarios may provide an improvement gain, as one can run a few tasks in parallel. It has a drawback though.

Let’s consider following code

public Task<int> A()
  await B1();
  await B2();
  return C ();

If b1 and b2 could be executed in parallel (for instance, they access Azure Table Storage), this method could be rewritten in a following way.

public Task<int> A()
  await Task.WhenAll(B1(), B2());
  return C ();

What, beside the mentioned performance improvement, changed? Now, method A is no longer a method A. There are two methods which can be randomly executed. One running operations in the following order: B1, B2, C, and the other: B2, B1, C. This effectively means, that your previous test coverage is no longer true. If you want to truly test it, you need to provide suites that will order these B* calls properly and ensure that all permutations will be emitted and tested. Sometimes it has no meaning, sometimes it has. Let’s consider a following scenario:

  • Two callers are calling A in the same time
  • Every B method removes a specific file, failing if it does not exist
  • At least one of the callers should succeed

In the first version, it was a pure race for being the first. The first that goes through B1, B2, C would execute properly. Now consider the second version of A with two callers executing following operations in the specified order:

  • Caller 1: B1, B2, C
  • Caller 2: B2, B1, C

As you can see, it’s a typical deadlock scenario and both callers would fail.

As always, there’s no silver bullet and if you want to use Task.WhenAll to speed up your application by running operations in parallel, you must embrace the fact of a possibly non linear execution.

Happy awaiting.

Rise of the IAsyncStateMachines

Whenever you use async/await pair, the compiler performs a lot of work creating a class that handles the coordination of code execution. The created (and instantiated) class implements an interface called IAsyncStateMachine and captures all the needed context to move on with the work. Effectively, any async method using await will generate such an object. You may say that creating objects is cheap, then again I’d say that not creating them at all is even cheaper. Could we skip creation of such an object still providing an asynchronous execution?

The costs of the async state machine

The first, already mentioned, cost is the allocation of the async state machine. If you take into consideration, that for every async call an object is allocated, then it can get heavy.

The second part is the influence on the stack frame. If you use async/await you will find that stack traces are much bigger now. The calls to methods of the async state machine are in there as well.

The demise of the IAsyncStateMachines

Let’s consider a following example:

public async Task A()
    await B ();

Or even more complex example below:

public async Task A()
    if (_hasFlag)
        await B1 ();
        await B2 ();

What can you tell about these A methods? They do not use the result of  the Bs. If they do not use it, maybe awaiting could be done on a higher level? Yes it can. Please take a look at the following example:

public Task A()
    if (_hasFlag)
        return B1 (); 
        return B2 ();

This method is still asynchronous, as asynchronous isn’t about using async await but about returning a Task. Additionally, it does not generate a state machine, which lowers all the costs mentioned above.

Happy asynchronous execution.

A pointer to a generic method argument

Let’s consider a following method signature of an interface taken from a RampUp interface.

bool Write<TMessage>(ref Envelope envelope, 
    ref TMessage message, IRingBuffer bufferToWrite) 
    where TMessage : struct;

It’s a fairly simple signature, enabling to pass a struct of any size using just a reference to it, without copying it. Now let’s consider the need of obtaining a pointer to this message. Taking a pointer could be needed for various reasons. One could be getting fields by offset, another could be using memcpy for copying the value to any given address. Is it possible to get this pointer in C# code?

No pointers for generic parameters

Unfortunately, you can’t do it in C#. If you try to obtain a pointer to a generic parameter, you’ll be informed about the compiler error. If you can’t do it in C#, is there any other .NET language one could use to get it? Yes, there is. It’s the foundation of .NET programs, the MSIL itself and if it’s MSIL, it means emitting code dynamically.

Ref looks like a pointer

What is a reference to a struct? It looks like a pointer to me. What if we could load it and just assume that it is a pointer? Would CLR accept this program? It occurs that it would. I won’t cover the whole implementation which can be found in here, but want to accent some points.

  • CLR uses the argument with index 0 to passing this. If you want to load a field you need to use the following sequence of operations:
    • Ldloc_0; // load this on the stack
    • Ldfld, “Field1” // pops this loading the value named “Field1” on the stack
  • For Write method, getting a pointer to a message is nothing more than calling an op code: Ldarg_2. As the struct is passed by reference, it can be treated as a pointer by CLR and it will.

I encourage you to download the RampUp codebase and play a little bit with an emitted implementation of the IMessageWriter. Maybe you’ll never need to take the pointer to a generic method parameter (I did), but it’s a good starter to learn a little about emitting code.

Using Fody to provide common parts for structs

The RampUp library is meant to provide low-latency, low/no-alloc environment for building fast systems in .NET. As it’s based on messaging in an actor-like/SEDA fashion, the messages are the first class citizen in its environment. Because of these requirements, unlike in other frameworks/systems, they’ve been built on structs. Yes, good old fashioned value types that has no virtual method tables, no object overhead. They’re just pure data. But even in the world of pure data sometimes you need a common denominator, which provides some basic information. Let me share my RampUp approach to this problem.


In case of RampUp and its messages, the part that should be attachable to every message is an envelope. You probably want to now the sender of the message and maybe a few more facts. We can’t derive as structure types cannot derive one from another. How can this be done, how to introduce at least one common field in all the messages? Having one field of type Envelope would be sufficient as we could use this field to store all the needed information.


There’s a tool created by Simon Cropp called Fody. It’s a AOP tool, a weaver, a post compiler. With this you can create ModuleWeavers, that are reusable (there’s a lot of them) and/or applied only in the solution they were created. Using this tool I’ve been able to deliver a weaver that scans for messages in a project and adds a specific envelope field. For each message a metadata is created describing the offset to the Envelope field. Additionally, on the basis of the metadata, a message reader and a writer are emitted so that the final user of RampUp does not need to access this field manually.


Using a post compiler is often seen as an overkill. On the other hand, being able to introduce a common denominator for a set of value types is impossible without either manual copy-paste techniques or weaving it in a post compilation process. I prefer the latter.


You probably used typeof operator a few times. It’s quite funny, that an operator like this actually has no MSIL counterpart. Using it emits TWO OpCodes.

The first emitted opcode is OpCodes.Ldtoken with the type. It consumes the token of the type, pushing the RuntimeTypeHandle structure onto the stack as the result of its operation. The second emitted code is a call to the Type.GetTypeFromHandle(RuntimeTypeHandle) which consumes the structure pushed by the previous code, returning the runtime type. The interesting thing is that you can’t use just OpCodes.Ldtoken from C#. You need to load the runtime type first and then you can access the handle by a property. You can emit IL with just OpCodes.Ldtoken though to remove overhead of calling a method and use the structure as a key for a lookup. It will be a bit faster for sure.

You can see the example of emitting this in RampUp code of the message writer.