.NET volatile write performance degradation in x86

TL;DR

This is a summary of my investigation about writing a fast and well designed concurrent queue for akka.net which performance was drastically low for 32bit application. Take a look at PR here. If you’re interested in writing a well performing no-alloc applications with mechanical symapthy in mind or you’re simply interested in good .NET concurrency this post is for you.

Akka.NET

Akka.NET is an actor system’s implementation for .NET platform. It has been ported from Java. Recently I spent some time playing with it and reading through the codebase. One of the classes that I took a look into was UnboundedMailboxQueue using a general purpose .NET BCL’s concurrent queue. It looked strange to me, as knowing a structure Envelope that is passed through this queue one could implement a better queue. I did it in this PR lowering number of allocations by 10% and speeding up the queue by ~8%. Taking into consideration that queues are foundations of akka actors, this result was quite promising. I used the benchmark tests provided with the platform and it looked good. Fortunately Jeff Cyr run some tests on x86 providing results that were disturbing. On x86 the new queue was underperforming. Finally I closed the PR without providing this change.

The queue design

The custom queue provided by use a similar design to the original concurrent queue. The difference was using Envelope fields (there are two: message & sender) to mark message as published without using the concurrent queue state array. Again, knowing the structure you want to passed to the other side via a concurrent queue was vital for this design. You can’t make a universal general collection. Note ‘general’, not ‘generic’.

Volatile

To make the change finally visible to a queue’s consumer, Volatile.Write was used. The only difference was the type being written. In the BCL’s concurrent queue that was bool in an array. In my case it was an object. Both used different overloads of Volatile.Write(ref ….). For sake of reference, Volatile.Write ensures release barrier so if a queue’s consumer reads status with Volatile.Read (the aquire barrier), it will finally see the written value.

Some kind of reproduction

To know how .net is performing this operations I’ve used two types and run a sample application with x64 and x86. Let’s take a look at the code first.

struct VolatileInt
{
int _value;

public void Write(int value)
{
_value = value;
}

public void WriteVolatile(int value)
{
Volatile.Write(ref _value, value);
}
}

struct VolatileObject
{
object _value;

public void Write(object value)
{
_value = value;
}

public void WriteVolatile(object value)
{
Volatile.Write(ref _value, value);
}
}

It’s really nothing fancy. These two either write the value ensuring release fence or just write the value.

Windbg for x86

The methods had been prepared using RuntimeHelpers.PrepareMethod(). A Windbg instance was attached to the process. I loaded sos clr and took a look at method tables of these two types. Because methods had been prepared, they were jitted so I could easily take a look at the jitted assembler. Because x64 was performing well, let’s take a look at x86. At the beginning let’s check the non-object method, VolatileInt.VolatileWrite

cmp     byte ptr [ecx],al
mov     dword ptr [ecx],edx
ret

Nothing heavy here. Effectively, just move a memory and return. Let’s take a look at writing the object with VolatileObject.VolatileWrite

cmp     byte ptr [ecx],al
lea     edx,[ecx]
call    clr!JIT_CheckedWriteBarrierEAX

Wow! Beside moving some data an additional method is called. The method name is JIT_CheckedWriteBarrierEAX (you probably this now that there may be a group of JIT_CheckedWriteBarrier methods). What is it and why does it appear only in x86?

CoreCLR to the rescue

Take a look at the following snippet and compare blocks for x86 and non-x86? What can you see? For x86 there are additional fragments, including the one mentioned before JIT_CheckedWriteBarrierEAX. What does it do? Let’s take a look at another piece of CoreCLR here. Let’s not dive into this implementation right now and what is checked during this call, but just taking a look at first instructions of this method one can tell that it’ll cost more than the simple int operation

cmp edx,dword ptr [clr!g_lowest_address]
jb      clr!JIT_CheckedWriteBarrierEAX+0x35
cmp     edx,dword ptr [clr!g_highest_address]
jae     clr!JIT_CheckedWriteBarrierEAX+0x35
mov     dword ptr [edx],eax
cmp     eax,dword ptr [clr!g_ephemeral_low]
jb      clr!JIT_CheckedWriteBarrierEAX+0x37
cmp     eax,dword ptr [clr!g_ephemeral_high]
jae     clr!JIT_CheckedWriteBarrierEAX+0x37
shr     edx,0Ah

Summing up
If you want to write well performing code and truly want to support AnyCPU, a proper benchmark tests run with different architectures should be provided.
Sometimes, a gain in one will cost you a lot in another. Even if for now this PR didn’t make it, this was an interesting journey and an extreme learning experience. There’s nothing better that to answer a childish ‘why?’ on your own.

Task.WhenAll tests

In the last post I’ve shown some optimizations one can apply to reduce the overhead on creating asynchronous state machines. Let’s dive into the async world again and consider helper methods provided by the Task class, especially Task.WhenAll.

public static Task WhenAll(params Task[] tasks)

The method works in a following way. It accepts an array of tasks and returns a task that will finish as soon as all of the underlying tasks are finished. Applying this method in some scenarios may provide an improvement gain, as one can run a few tasks in parallel. It has a drawback though.

Let’s consider following code

public Task<int> A()
{
  await B1();
  await B2();
  return C ();
}

If b1 and b2 could be executed in parallel (for instance, they access Azure Table Storage), this method could be rewritten in a following way.


public Task<int> A()
{
  await Task.WhenAll(B1(), B2());
  return C ();
}

What, beside the mentioned performance improvement, changed? Now, method A is no longer a method A. There are two methods which can be randomly executed. One running operations in the following order: B1, B2, C, and the other: B2, B1, C. This effectively means, that your previous test coverage is no longer true. If you want to truly test it, you need to provide suites that will order these B* calls properly and ensure that all permutations will be emitted and tested. Sometimes it has no meaning, sometimes it has. Let’s consider a following scenario:

  • Two callers are calling A in the same time
  • Every B method removes a specific file, failing if it does not exist
  • At least one of the callers should succeed

In the first version, it was a pure race for being the first. The first that goes through B1, B2, C would execute properly. Now consider the second version of A with two callers executing following operations in the specified order:

  • Caller 1: B1, B2, C
  • Caller 2: B2, B1, C

As you can see, it’s a typical deadlock scenario and both callers would fail.

As always, there’s no silver bullet and if you want to use Task.WhenAll to speed up your application by running operations in parallel, you must embrace the fact of a possibly non linear execution.

Happy awaiting.

Size of Nullable

What’s the size of nullable structure in C#? Does it depends on anything? How fields are held in the memory. Let’s consider following cases: bool?, int?, long?, Guid? One way to get the structure size is use of Marshal.SizeOf(Type). Unfortunately this method performs a check whether the passed type is generic. Every nullable is generic, hence we cannot use it. If you take a look at the implementation of this method, there is a call to the private method of the Marshal class, named SizeOfHelper. This method does not perform a check and can be easily used to calculate the size of the struct.

Nullable consists of two fields. The first is hasValue which answers to the question if the nullable has a value assigned. The other is the value. If a nullable has no value assigned, the value field will held default(T) value. How this members are aligned in the memory, does it depend on anything? What is the offset from the start of the structure of these specific fields?

To answer the two questions above (size and alignment) please take a look at the following table:

Type Size hasValue offset value offset
bool? 8 0 4
int? 8 0 4
long? 16 0 8
Guid? 20 0 4

The first two: bool? and int? are easy to come up with. Bool is equal to int, it takes 4 bytes to store one, so the offset of the value is 4.

What about long? Why does it take 16 bytes, not 12? Why the value starts at 8, not 4? That’s because of the struct alignment, which CLR performs to enable nice packing up the given struct. In other way, the struct is aligned to the length of the 64bit CPU registries.

The final example with Guid breaks the rule for long. Or maybe not? The struct size is multiplication of 8 bytes, so it’s totally ok for CLR to use 24 bytes as it is already aligned.

If you want to do some checks on your own, you may use the gist I created.

A CRUD action provider

As it was described in the previous entry, the idea of the action providers is to allow to gather cross-cutting concerns into more generic (not in the C# terms, but functionalities/features) bag than a derivation from a generic controller. To provide a working example, a standard CRUD operations can be taken into the consideration. Let’s take a look into another gist.
I do hate static code, but for sake of a simplicity all the CRUD operations were moved to a static class, CrudActions. All of them are pretty simple, as they use an entity model binder introduced in the former blog entries. For instance, when an entity edit result is posted to the server, a model binder will take care of updating the entity, so the entity being passed to the [HttpPost]Edit will be already altered with the changes made by a user.
The very second step was discovering all of the CrudActions. It is made by a DetachedActionProvider which caches information in a ReadWriteCache taken from the MVC internals. To close the open generic of CrudActions, an entity type with an id’s type is needed. The entity type information is retrieved from a IEntityController, which is a simple markup convention (you can come up with a name convention, or delegate it somewhere else), for attaching to a controller a handled entity type. The id metadata are gathered from the NHibernate’s ISessionFactory. Once the data are retrieved, the CrudActions generic class definition can be turned into a closed generic and the specific method can be found.
Wait a second… how do we transform a MethodInfo into an ActionDescriptor? First, the method infos are wrapped with an ActionInfo instance (it caches attributes as well as the method info itself). Next, the ActionInfos, those who applies (HttpGet, HttpPost, etc.), are passed to the ReflectedMvc to create a ReflectedActionDescriptor.
At the end I left some bitterness about MVC internalization policy. I must admit, that I don’t understand internalizing types or constructors or any members which could be helpful for future app development. That case applies to the ReflectedActionDescriptor ctor, which (the public one) throws an exception if the method passed to it is a static method. The other, non-checking one, is of course internalized. I’m asking: why? The internalization should be used as a last resort, no as a ‘hide it, because I cannot predict if someone doesn’t want to crush MVC in his project’ rule.

Hope you like the action provider extension point and will use it for cross-cutting, ‘I can compose, not only derive from a generic class’ actions in your controllers.

An action provider, other than controller

Yep, that’s another post in a series which will finally bring so-called detached actions to live. Today I’ll introduce the concept of action providers, something, which in my opinion is missing in the ASP MVC.

The standard pipeline of handling a request in ASP MVC starts with creating a controller with implementation of IControllerFactory. The default one, uses simply a controller constructor, passing none to it. Then controller is returned and its Controller.ActionInvoker is queried for a method/action matching the request. If none is found, an exception is thrown. After reading this paragraph it must be clear, that adding an action to a controller can be done also on the action invoker level.

The IActionInvoker interface provides not much space to move, since it has only one, bool method. If we take a look into the default implementation, ControllerActionInvoker, we can find an interesting method called: FindAction. The method returns something called ActionDescriptor and in the default implementation scans only the controller type in the search of the actions. But we can describe some static methods and return them during the search. To address it, I introduced an IActionProvider interface, which can be (almost) easily implemented.

To get the whole idea, take a look into GIST: https://gist.github.com/989733

As always, all needed types should be registered in the Windsor Container and resolved with it (in the current example, the only need which is to be resolved is IControllerFactory, or nothing if you register service location with the new feature from MVC 3).

Hope you like it. In the very next post, the final implementation of one of IActionProviders.

NHibernate EntityAwareModelBinder for ASP MVC

It’s time to provide some more features before getting deeper into detached actions for ASP MVC. The very first requirement is to have a nice model binder, which not only uses a container to resolve all the parameter types (my previous model binder), but also is intelligent enough to deal with an ORM of your choice. And mine is NHibernate.

Before getting the final snippet, a few assumptions has to be made:

  • No hierarchies are bound automatically. For sake of simplicity, we’re considering only simple entities with no derivations.
  • All the collections, which reflect one-to-many relationships, are bound in the inverse way. The one ending has its collection marked as inversed (there is a drop-down for the many ending, where the parent can be selected)

To provide you with a better way of viewing the code, I pushed the file to the gisthub and it can be located under: https://gist.github.com/984310. It’s good time to take a look into it.

As you can see, the binder does some pretty nice things:

  • It still uses container as the fallback for all the models. Using Windsor you’d better be sure to have all of them registered
  • If entity has its id passed, it uses session to get it; otherwise a new instance is created. It’s up to you to save it in your session

Maybe it addresses a few concerns, like handling entities and non entities in one binder, but it’d be easier to get the whole idea having one straight class without the whole projects.

This is my default model binder in my current project so far and it provides strong base for the detached actions mechanism, which will be covered in the very next entry.

Thoughts?

How to guide users of your API

After winter holidays it was time to get back to the reality and to make up for the pleanty of unread posts in my reader. One of the read posts was Christopher Bennage’s entry about RavenDB API. The main problem with API was that it allows calling every Linq extension method on the Raven session. Even, the unwanted one – ToList. Christoper’s proposal is to provide an obsolete ToList method, which accepts the Raven query interface, and by being obsolete informs about the best way of getting a list of objects.

Simple, nice, powerful.