.NET volatile write performance degradation in x86

TL;DR

This is a summary of my investigation about writing a fast and well designed concurrent queue for akka.net which performance was drastically low for 32bit application. Take a look at PR here. If you’re interested in writing a well performing no-alloc applications with mechanical symapthy in mind or you’re simply interested in good .NET concurrency this post is for you.

Akka.NET

Akka.NET is an actor system’s implementation for .NET platform. It has been ported from Java. Recently I spent some time playing with it and reading through the codebase. One of the classes that I took a look into was UnboundedMailboxQueue using a general purpose .NET BCL’s concurrent queue. It looked strange to me, as knowing a structure Envelope that is passed through this queue one could implement a better queue. I did it in this PR lowering number of allocations by 10% and speeding up the queue by ~8%. Taking into consideration that queues are foundations of akka actors, this result was quite promising. I used the benchmark tests provided with the platform and it looked good. Fortunately Jeff Cyr run some tests on x86 providing results that were disturbing. On x86 the new queue was underperforming. Finally I closed the PR without providing this change.

The queue design

The custom queue provided by use a similar design to the original concurrent queue. The difference was using Envelope fields (there are two: message & sender) to mark message as published without using the concurrent queue state array. Again, knowing the structure you want to passed to the other side via a concurrent queue was vital for this design. You can’t make a universal general collection. Note ‘general’, not ‘generic’.

Volatile

To make the change finally visible to a queue’s consumer, Volatile.Write was used. The only difference was the type being written. In the BCL’s concurrent queue that was bool in an array. In my case it was an object. Both used different overloads of Volatile.Write(ref ….). For sake of reference, Volatile.Write ensures release barrier so if a queue’s consumer reads status with Volatile.Read (the aquire barrier), it will finally see the written value.

Some kind of reproduction

To know how .net is performing this operations I’ve used two types and run a sample application with x64 and x86. Let’s take a look at the code first.

struct VolatileInt
{
int _value;

public void Write(int value)
{
_value = value;
}

public void WriteVolatile(int value)
{
Volatile.Write(ref _value, value);
}
}

struct VolatileObject
{
object _value;

public void Write(object value)
{
_value = value;
}

public void WriteVolatile(object value)
{
Volatile.Write(ref _value, value);
}
}

It’s really nothing fancy. These two either write the value ensuring release fence or just write the value.

Windbg for x86

The methods had been prepared using RuntimeHelpers.PrepareMethod(). A Windbg instance was attached to the process. I loaded sos clr and took a look at method tables of these two types. Because methods had been prepared, they were jitted so I could easily take a look at the jitted assembler. Because x64 was performing well, let’s take a look at x86. At the beginning let’s check the non-object method, VolatileInt.VolatileWrite

cmp     byte ptr [ecx],al
mov     dword ptr [ecx],edx
ret

Nothing heavy here. Effectively, just move a memory and return. Let’s take a look at writing the object with VolatileObject.VolatileWrite

cmp     byte ptr [ecx],al
lea     edx,[ecx]
call    clr!JIT_CheckedWriteBarrierEAX

Wow! Beside moving some data an additional method is called. The method name is JIT_CheckedWriteBarrierEAX (you probably this now that there may be a group of JIT_CheckedWriteBarrier methods). What is it and why does it appear only in x86?

CoreCLR to the rescue

Take a look at the following snippet and compare blocks for x86 and non-x86? What can you see? For x86 there are additional fragments, including the one mentioned before JIT_CheckedWriteBarrierEAX. What does it do? Let’s take a look at another piece of CoreCLR here. Let’s not dive into this implementation right now and what is checked during this call, but just taking a look at first instructions of this method one can tell that it’ll cost more than the simple int operation

cmp edx,dword ptr [clr!g_lowest_address]
jb      clr!JIT_CheckedWriteBarrierEAX+0x35
cmp     edx,dword ptr [clr!g_highest_address]
jae     clr!JIT_CheckedWriteBarrierEAX+0x35
mov     dword ptr [edx],eax
cmp     eax,dword ptr [clr!g_ephemeral_low]
jb      clr!JIT_CheckedWriteBarrierEAX+0x37
cmp     eax,dword ptr [clr!g_ephemeral_high]
jae     clr!JIT_CheckedWriteBarrierEAX+0x37
shr     edx,0Ah

Summing up
If you want to write well performing code and truly want to support AnyCPU, a proper benchmark tests run with different architectures should be provided.
Sometimes, a gain in one will cost you a lot in another. Even if for now this PR didn’t make it, this was an interesting journey and an extreme learning experience. There’s nothing better that to answer a childish ‘why?’ on your own.

Relaxed Optimistic Concurrency

TL;DR

When using the optimistic concurrency approach for entities that are updated frequently, some of the actions may fail because of the conflicting version numbers. A proper modelling technique distilling if business requirements can be loosened may greatly increase the chances of succeeding with commands issued against these entities improving overall performance of an application and a lowering a probability of errors.

Optimistic Concurrency

The optimistic concurrency is an approach for ensuring non overlapping updates over a given entity. It’s supported by the majority of heavy ORMs and applied simply by adding a conditional where at the end of the update. For example

UPDATE Orders
-- more updated columns
SET version = @version + 1
WHERE id = @id AND version = @version

This approach ensures, that if any other operation updated the entity in the meantime, this update will fail. Additionally, if an ORM is capable of counting rows that should have been updated, like NHibernate does, it can abort a transaction and throw an exception informing that some of the operations that were planned to be executed failed.

The optimistic concurrency approach is not a unique SQL technique. It’s popular in many NoSql databases like Azure Table Storage for example. When updating an entity, its ETag is added as the If-Match header, ensuring, that if the entity was modified after retrieval and updated, the operation, again, will fail. See Update operation documentation here.

Finally, when applying Domain Driven Design and operating on an Aggregate Root, this technique is the easiest one to ensure, that the aggregate root is truly a transaction boundary. If the root has its version updated with every change of the aggregate, then two concurrent operations cannot be executed and one will fail, still, preserving the root as a transaction boundary. This applies to aggregate roots, no matter if you immerse them into Event Sourcing or a regular ORM mapped graph of entities. Just update the root with every operation and your aggregate will be just fine.

As it’s been shown above, optimistic concurrency is a simple and powerful tool that in a world of NoSql and transactional-boundaries-got-right may be the only one to ensure atomicity of operations.

Limitations

When using optimistic concurrency, the flow of applying a change is a bit different. Instead of just updating a property, or a value, the following approach is taken

  1. An aggregate is retrieved with its version
  2. If the state allows it, a command is executed
  3. The aggregates’ state is updated conditionally (if the version is unchanged)

Again, this ensures that the updated is applied on the version that a business logic operated onto, but limits the concurrent access.

For services using Event Sourcing, instead of retrieving entity all of the events are retrieved and a state of an aggregate is rebuilt. If snapshots are used, only events with versions bigger than a snapshot must be retrieved. If the snapshot is preserved in a in memory cache, then possibly, no events will be retrieved if the snapshot’s version is equal to the number of aggregate’s events so far. Events that are a result of a command are appended to the store conditionally. Depending on the storage it can be the stream version when using EventStore AppendToStreamAsync or update of a root markup entity when using a custom relational store.

An example

Let’s consider an example of a GitHub-like issue. Every issue has an option of locking it. It can be used for instance to lock an issue created by a troll (you don’t feed the troll) and disallow adding more comments. For sake of argument:

  1. let’s model all comments as a part of the issue aggregate (as always, there are many models that can be applied)
  2. optimistic concurrency is used for all commands.

A business requirement for locking an issue could look like:

when an issue is locked no user should be able to add more comments

It’s quite common, that when seeing a requirement like this, developers don’t ask questions. It’s even more unfortunate, that some companies require to just follow the analysis. Let’s try to relax this requirement a little bit by asking some questions:

  1. Is it required to lock the issue immediately?
  2. Could an issue be considered locked after some short period of time (less than 1s) after locking it?
  3. Could we allow adding some comments during this period?

If the answers point towards no need of an immediate lock, there’s a space to handle locking in a relaxed manner

Relaxed Optimistic Concurrency

If an operation can have its preconditions relaxed and can be performed after achieving some state it can be executed with much less friction. In the previous example, the state when a user can add a comment is a created issue. The precondition is a non-locked issue, but it’s ok to add a comment to a locked issue within some time boundaries. Consider the following flow

  1. An aggregate is retrieved with its version
  2. If the state allows it, a command is executed
  3. The aggregates’ state is updated conditionally (if the version is unchanged) appending the change unconditionally

Depending on the storage and the applied design in can be done in many ways.

When using Event Sourcing with EventStore a special version can be passed to the appending method which represents any version. This appends events unconditionally. This means that a locking operation and adding a comment can be done in parallel without conflicts!

When using a relational database, an issue entity can be retrieved to check it’s state. Next, a comment entity can be added separately, without updating the version of the issue itself. Again, because adding a comment does not change the version, the friction on the aggregate is lowered.

Summing up

Don’t take requirements for granted, but rather ask for the reasoning behind them. Try to relax requirements for areas which may suffer from the high contention. The model is just a model. There are no true or false models but these which help you or make your work harder. Choose wisely:)

RampUp journey ends

My RampUp project has been an important journey to me. I verified my knowledge about modern hardware and dealt with interesting cases writing a really low level code in .NET/C#. Finally, writing code that does not allocate is demanding but doable. This journey ends today.

RampUp has been greatly influenced by Aeron, a tool for publishing your messages really fast. You can easily saturate the bandwidth and get very low service time. Fortunately, for .NET ecosystem, Adaptive Consulting has just opened an Aeron client port here. This is unfortunate for RampUp as this port covers ~70-80% of RampUp implementation. Going on with RampUp would mean, that I need to chase Aeron and compete somehow with Aeron.NET to get an .NET environment attention, and in result, participation (PRs. issues, actual use cases). This is what I don’t want to do.

RampUp source code won’t be removed as still, I find it interesting and it shows some really interesting low level cases of CLR. I hope you enjoyed the ride. Time to move on:)

Task.WhenAll tests

In the last post I’ve shown some optimizations one can apply to reduce the overhead on creating asynchronous state machines. Let’s dive into the async world again and consider helper methods provided by the Task class, especially Task.WhenAll.

public static Task WhenAll(params Task[] tasks)

The method works in a following way. It accepts an array of tasks and returns a task that will finish as soon as all of the underlying tasks are finished. Applying this method in some scenarios may provide an improvement gain, as one can run a few tasks in parallel. It has a drawback though.

Let’s consider following code

public Task<int> A()
{
  await B1();
  await B2();
  return C ();
}

If b1 and b2 could be executed in parallel (for instance, they access Azure Table Storage), this method could be rewritten in a following way.


public Task<int> A()
{
  await Task.WhenAll(B1(), B2());
  return C ();
}

What, beside the mentioned performance improvement, changed? Now, method A is no longer a method A. There are two methods which can be randomly executed. One running operations in the following order: B1, B2, C, and the other: B2, B1, C. This effectively means, that your previous test coverage is no longer true. If you want to truly test it, you need to provide suites that will order these B* calls properly and ensure that all permutations will be emitted and tested. Sometimes it has no meaning, sometimes it has. Let’s consider a following scenario:

  • Two callers are calling A in the same time
  • Every B method removes a specific file, failing if it does not exist
  • At least one of the callers should succeed

In the first version, it was a pure race for being the first. The first that goes through B1, B2, C would execute properly. Now consider the second version of A with two callers executing following operations in the specified order:

  • Caller 1: B1, B2, C
  • Caller 2: B2, B1, C

As you can see, it’s a typical deadlock scenario and both callers would fail.

As always, there’s no silver bullet and if you want to use Task.WhenAll to speed up your application by running operations in parallel, you must embrace the fact of a possibly non linear execution.

Happy awaiting.

Rise of the IAsyncStateMachines

Whenever you use async/await pair, the compiler performs a lot of work creating a class that handles the coordination of code execution. The created (and instantiated) class implements an interface called IAsyncStateMachine and captures all the needed context to move on with the work. Effectively, any async method using await will generate such an object. You may say that creating objects is cheap, then again I’d say that not creating them at all is even cheaper. Could we skip creation of such an object still providing an asynchronous execution?

The costs of the async state machine

The first, already mentioned, cost is the allocation of the async state machine. If you take into consideration, that for every async call an object is allocated, then it can get heavy.

The second part is the influence on the stack frame. If you use async/await you will find that stack traces are much bigger now. The calls to methods of the async state machine are in there as well.

The demise of the IAsyncStateMachines

Let’s consider a following example:

public async Task A()
{
    await B ();
}

Or even more complex example below:

public async Task A()
{
    if (_hasFlag)
    {
        await B1 ();
    }
    else
    {
        await B2 ();
    }
}

What can you tell about these A methods? They do not use the result of  the Bs. If they do not use it, maybe awaiting could be done on a higher level? Yes it can. Please take a look at the following example:

public Task A()
{
    if (_hasFlag)
    {
        return B1 (); 
    }
    else
    {
        return B2 ();
    }
}

This method is still asynchronous, as asynchronous isn’t about using async await but about returning a Task. Additionally, it does not generate a state machine, which lowers all the costs mentioned above.

Happy asynchronous execution.

Build Tour 2016 in Warsaw

Probably every .NET developer watched at least a few presentations from the Build conference. Happily for Polish .NET developers, Warsaw has been selected as one of the cities where Build speakers came to share their exciting news about Microsoft, Windows and .NET.

The conference took place in Expo XXI. The very first thing were containers with Build logo on them. I must admit that this was a bold way of showing where the conference takes place and much better than just a few stands. The venue and the scene were amazing. It was a copy of the Build scene so you could feel the atmosphere. Every organizational aspect of it was well thought through.

The key topics were Universal Windows Platform apps, Web Apps and IoT. Even if I don’t work with these on daily basis the level of talks, as well as the way they were being presented, was balanced enough to actually show and share some meaningful information. The top three for me were:

  1. The packing of Universal Windows Platform apps. Running an installer in a recording container to be able to apply all the operations on the system later on.
  2. Manifoldjs for building hosted we apps.
  3. Enhanced Reality provided with the Unity engine.

It was quite sad that people started leaving quite early. Fortunately, a solid group (including me:P) stayed till the very end when a networking took place. I’m looking forward to seeing Build 2017 again. Of course, in Warsaw.

Views’ warm up for Event Sourcing

When using Event Sourcing as a foundation for your solution, the command part is a solved problem. Just take an aggregate version, a command, apply onto a state and try append created events to a store, checking the version again. There is a read part of this as well, called views, which is nothing more than an aggregation of a subset of events from the system. This works like a live query, which consumes events from a log and applies them on the projection on and on. Considering that the number of events is constantly growing, how would you deploy a new version of application containing a new view which needs to be build from the beginning, from the very first event? Even with a well performing database applying a few millions of events can take a while.

Warm up routine

Let’s consider a following routine. Instead of calling views by their names, a system version is appended. Take a view ‘users’ as an example

  1. for version 1.0.0 it’s “users-1_0_0”
  2. for version 1.2.0 it’s “users-1_2_0”

Before publishing the new version and moving all users to use it, predeploy the application and run its view builder. The views will be rebuild in the background, taking needed time. Once the builder starts to have problems with getting last events as there’s no more data, the views are prebuilt and a new version can be deployed.

Cost and optimization

Of course these views rebuild can be tedious and long. These could increase costs of your app as well as put an additional pressure on the event store. If a cost/performance optimization is needed, you can consider detection of a view change, something very similar to what Rinat did a few years back. You may come up with something more explicit as well. For any mechanism, the rule would be that if a view is the same, your app uses the last existing version of the view.

Providing this warm up routine, especially when using blue green deployments can improve not only the performance of your application start (the new version scenario) but also can provide an environment for testing the deployment before switching to the new version.