Async programming model

TL;DR

This is a follow up post to Async pump for better throughput in Azure. Please read the first before moving forward.

Feedback

I’ve been given a lot of feedback about my Async pump post. In a few cases this blog post from Ayende was quoted as it describes exactly the same approach. You can read the post, but more meaningful are comments provided by Kelly Sommers and Clemens Vasters.

The model

The await statement has simple semantics. It breaks your code and schedules the following part as a task continuation. This heavy lifting is done on the C# compiler level, so you don’t have to worry about. The model of this extension is simple: define a task and a continuation. Nothing more, nothing less. With my approach, that was a “trick”

That’s the premise of the “trick” that is allegedly achieving parallel execution of I/O and compute work here. That is, however, not the purpose of the asynchronous programming model and of the Windows IO completion port (IOCP) model. The point of IOCP is to efficiently offload IO work from user code to kernel and driver and hardware and not to bother the user code until the IO work is done

by Clemens Vasters

What it basically says is once you await on IO operation, your code that is run after, is scheduled on the IO thread

As IO typically takes very long and compute work is comparatively cheap, the goal of the IO system is to keep the thread count low (ideally one per core) and schedule all callbacks (and thus execution of interleaved user code) on that one thread. That means that, ideally, all work gets serialized and there minimal context switching as the OS scheduler owns the thread. The point of this model is ALREADY to keep the CPU nicely busy with parallel work while there is pending IO.

by Clemens Vasters

So with your code awaiting some IO, you’ll call IO, next the code after await is executed on the IO thread, as it’s assumed to be lightweight, another IO occurs, again the same thread dispatched the callback. With the async pump I proposed, comes a danger, as when we follow this partially awaitless approach the continuation code is

not on a .NET poll thread, but an IO pool thread. The strategy above sits on that IO pool thread past the point where you are supposed to return it (which is the next IO call) and thus you might force the IO pool to grow

by Clemens Vasters

Summary

Measure first. Think. Then think again. The “trick” worked in my case, improving performance for initialization of one app (I needed it in the beginning). Does it work in every case and should be used in general, I’d say no. It’s good to follow the programming model. If you don’t want to, you must have strong reasons for it.

Async pump for better throughput in Azure

This post is followed up by 
https://blog.scooletz.com/2017/02/20/async-programming-model

TL;DR

Introducing async-await has changed a lot. Now, with some compiler’s help we’re able to squeeze out more throughput from our machines, which may lower costs and increase throughput. In this blog post we’ll push the boundaries even further by questioning the need of immediate awaiting on a task.

Background

The story behind this pattern is simple. I’m using a part of the Azure Storage Services, the page blob. It provides storage API targeted at random IO & page aligned reads and writes. This is a perfect solution for emulating disk IO (it’s used for VMs’ disks), but could be used in cases where you want to have ability to write to a file in Azure, under specific index. You can just read a page, modify it and write it back. If you’re interested in this topic, take a look at my talks, maybe we’ll meet during my presentation Keep Its Storage Simple Stupid.

Synchronous

It’s obvious that for reading/writing from Storage Services you want to use asyncified code. Using blocking calls like the one below freezes a thread, which considering the money you already pay, is not the best option. Remember, it’s the cloud and regular Storage Services are backed up by HDD disks. It might take a while. Still, let’s take a look at the sync version first. We’ll operate on a stream that has been opened from the blob.

pump.png

The method above reads the buffer from a stream. It dispatches a read after a read to fill the buffer.

Asynchronous

The async-ified version of this reader is not that different. It uses just async Task in its signature and awaits one the Read. We’ll have no blocking calls, leaving some spare CPU cycles for other operations.

pump

Asynchronous pump

In the last attempt, we need to ask what do we read this buffer for? In my case, that’s for scanning over its content. I need to read a page blob from a given position and scan/deserialize it in a C# code. I do not want to preserve the buffer. As soon as I read it, I can move on. It’s just about reading a log, nothing more. The second property of it is: entries in this log are aligned to pages, so are well aligned for reading. Can we modify the reading part then?

We could think of using two buffers/streams. Schedule a read in the first, then in a loop:

  1. schedule a read in the second
  2. await on the first
  3. swap firstsecond
  4. go to 1

If we used this algorithm, we’d have a higher probability of one operation being already ended and ready to be dispatched. This means that our algorithm, possibly, could work on prefetched data without any interruptions, having the data ready when it needs it. For sake of simplicity, the buffer array, ReadBuffer were closed in a simple helper class called Buffer.

pump.png

Summary

Having something ready to be awaited does not mean that you should await it immediately. Using this two buffer approach can increase the throughput of your algorithm by ensuring that data are fetched before they are needed. Now it’s time for you to search for some pumping potential in your code!

.NET volatile write performance degradation in x86

TL;DR

This is a summary of my investigation about writing a fast and well designed concurrent queue for akka.net which performance was drastically low for 32bit application. Take a look at PR here. If you’re interested in writing a well performing no-alloc applications with mechanical symapthy in mind or you’re simply interested in good .NET concurrency this post is for you.

Akka.NET

Akka.NET is an actor system’s implementation for .NET platform. It has been ported from Java. Recently I spent some time playing with it and reading through the codebase. One of the classes that I took a look into was UnboundedMailboxQueue using a general purpose .NET BCL’s concurrent queue. It looked strange to me, as knowing a structure Envelope that is passed through this queue one could implement a better queue. I did it in this PR lowering number of allocations by 10% and speeding up the queue by ~8%. Taking into consideration that queues are foundations of akka actors, this result was quite promising. I used the benchmark tests provided with the platform and it looked good. Fortunately Jeff Cyr run some tests on x86 providing results that were disturbing. On x86 the new queue was underperforming. Finally I closed the PR without providing this change.

The queue design

The custom queue provided by use a similar design to the original concurrent queue. The difference was using Envelope fields (there are two: message & sender) to mark message as published without using the concurrent queue state array. Again, knowing the structure you want to passed to the other side via a concurrent queue was vital for this design. You can’t make a universal general collection. Note ‘general’, not ‘generic’.

Volatile

To make the change finally visible to a queue’s consumer, Volatile.Write was used. The only difference was the type being written. In the BCL’s concurrent queue that was bool in an array. In my case it was an object. Both used different overloads of Volatile.Write(ref ….). For sake of reference, Volatile.Write ensures release barrier so if a queue’s consumer reads status with Volatile.Read (the aquire barrier), it will finally see the written value.

Some kind of reproduction

To know how .net is performing this operations I’ve used two types and run a sample application with x64 and x86. Let’s take a look at the code first.

struct VolatileInt
{
int _value;

public void Write(int value)
{
_value = value;
}

public void WriteVolatile(int value)
{
Volatile.Write(ref _value, value);
}
}

struct VolatileObject
{
object _value;

public void Write(object value)
{
_value = value;
}

public void WriteVolatile(object value)
{
Volatile.Write(ref _value, value);
}
}

It’s really nothing fancy. These two either write the value ensuring release fence or just write the value.

Windbg for x86

The methods had been prepared using RuntimeHelpers.PrepareMethod(). A Windbg instance was attached to the process. I loaded sos clr and took a look at method tables of these two types. Because methods had been prepared, they were jitted so I could easily take a look at the jitted assembler. Because x64 was performing well, let’s take a look at x86. At the beginning let’s check the non-object method, VolatileInt.VolatileWrite

cmp     byte ptr [ecx],al
mov     dword ptr [ecx],edx
ret

Nothing heavy here. Effectively, just move a memory and return. Let’s take a look at writing the object with VolatileObject.VolatileWrite

cmp     byte ptr [ecx],al
lea     edx,[ecx]
call    clr!JIT_CheckedWriteBarrierEAX

Wow! Beside moving some data an additional method is called. The method name is JIT_CheckedWriteBarrierEAX (you probably this now that there may be a group of JIT_CheckedWriteBarrier methods). What is it and why does it appear only in x86?

CoreCLR to the rescue

Take a look at the following snippet and compare blocks for x86 and non-x86? What can you see? For x86 there are additional fragments, including the one mentioned before JIT_CheckedWriteBarrierEAX. What does it do? Let’s take a look at another piece of CoreCLR here. Let’s not dive into this implementation right now and what is checked during this call, but just taking a look at first instructions of this method one can tell that it’ll cost more than the simple int operation

cmp edx,dword ptr [clr!g_lowest_address]
jb      clr!JIT_CheckedWriteBarrierEAX+0x35
cmp     edx,dword ptr [clr!g_highest_address]
jae     clr!JIT_CheckedWriteBarrierEAX+0x35
mov     dword ptr [edx],eax
cmp     eax,dword ptr [clr!g_ephemeral_low]
jb      clr!JIT_CheckedWriteBarrierEAX+0x37
cmp     eax,dword ptr [clr!g_ephemeral_high]
jae     clr!JIT_CheckedWriteBarrierEAX+0x37
shr     edx,0Ah

Summing up
If you want to write well performing code and truly want to support AnyCPU, a proper benchmark tests run with different architectures should be provided.
Sometimes, a gain in one will cost you a lot in another. Even if for now this PR didn’t make it, this was an interesting journey and an extreme learning experience. There’s nothing better that to answer a childish ‘why?’ on your own.

Task.WhenAll tests

In the last post I’ve shown some optimizations one can apply to reduce the overhead on creating asynchronous state machines. Let’s dive into the async world again and consider helper methods provided by the Task class, especially Task.WhenAll.

public static Task WhenAll(params Task[] tasks)

The method works in a following way. It accepts an array of tasks and returns a task that will finish as soon as all of the underlying tasks are finished. Applying this method in some scenarios may provide an improvement gain, as one can run a few tasks in parallel. It has a drawback though.

Let’s consider following code

public Task<int> A()
{
  await B1();
  await B2();
  return C ();
}

If b1 and b2 could be executed in parallel (for instance, they access Azure Table Storage), this method could be rewritten in a following way.


public Task<int> A()
{
  await Task.WhenAll(B1(), B2());
  return C ();
}

What, beside the mentioned performance improvement, changed? Now, method A is no longer a method A. There are two methods which can be randomly executed. One running operations in the following order: B1, B2, C, and the other: B2, B1, C. This effectively means, that your previous test coverage is no longer true. If you want to truly test it, you need to provide suites that will order these B* calls properly and ensure that all permutations will be emitted and tested. Sometimes it has no meaning, sometimes it has. Let’s consider a following scenario:

  • Two callers are calling A in the same time
  • Every B method removes a specific file, failing if it does not exist
  • At least one of the callers should succeed

In the first version, it was a pure race for being the first. The first that goes through B1, B2, C would execute properly. Now consider the second version of A with two callers executing following operations in the specified order:

  • Caller 1: B1, B2, C
  • Caller 2: B2, B1, C

As you can see, it’s a typical deadlock scenario and both callers would fail.

As always, there’s no silver bullet and if you want to use Task.WhenAll to speed up your application by running operations in parallel, you must embrace the fact of a possibly non linear execution.

Happy awaiting.

Size of Nullable

What’s the size of nullable structure in C#? Does it depends on anything? How fields are held in the memory. Let’s consider following cases: bool?, int?, long?, Guid? One way to get the structure size is use of Marshal.SizeOf(Type). Unfortunately this method performs a check whether the passed type is generic. Every nullable is generic, hence we cannot use it. If you take a look at the implementation of this method, there is a call to the private method of the Marshal class, named SizeOfHelper. This method does not perform a check and can be easily used to calculate the size of the struct.

Nullable consists of two fields. The first is hasValue which answers to the question if the nullable has a value assigned. The other is the value. If a nullable has no value assigned, the value field will held default(T) value. How this members are aligned in the memory, does it depend on anything? What is the offset from the start of the structure of these specific fields?

To answer the two questions above (size and alignment) please take a look at the following table:

Type Size hasValue offset value offset
bool? 8 0 4
int? 8 0 4
long? 16 0 8
Guid? 20 0 4

The first two: bool? and int? are easy to come up with. Bool is equal to int, it takes 4 bytes to store one, so the offset of the value is 4.

What about long? Why does it take 16 bytes, not 12? Why the value starts at 8, not 4? That’s because of the struct alignment, which CLR performs to enable nice packing up the given struct. In other way, the struct is aligned to the length of the 64bit CPU registries.

The final example with Guid breaks the rule for long. Or maybe not? The struct size is multiplication of 8 bytes, so it’s totally ok for CLR to use 24 bytes as it is already aligned.

If you want to do some checks on your own, you may use the gist I created.

A CRUD action provider

As it was described in the previous entry, the idea of the action providers is to allow to gather cross-cutting concerns into more generic (not in the C# terms, but functionalities/features) bag than a derivation from a generic controller. To provide a working example, a standard CRUD operations can be taken into the consideration. Let’s take a look into another gist.
I do hate static code, but for sake of a simplicity all the CRUD operations were moved to a static class, CrudActions. All of them are pretty simple, as they use an entity model binder introduced in the former blog entries. For instance, when an entity edit result is posted to the server, a model binder will take care of updating the entity, so the entity being passed to the [HttpPost]Edit will be already altered with the changes made by a user.
The very second step was discovering all of the CrudActions. It is made by a DetachedActionProvider which caches information in a ReadWriteCache taken from the MVC internals. To close the open generic of CrudActions, an entity type with an id’s type is needed. The entity type information is retrieved from a IEntityController, which is a simple markup convention (you can come up with a name convention, or delegate it somewhere else), for attaching to a controller a handled entity type. The id metadata are gathered from the NHibernate’s ISessionFactory. Once the data are retrieved, the CrudActions generic class definition can be turned into a closed generic and the specific method can be found.
Wait a second… how do we transform a MethodInfo into an ActionDescriptor? First, the method infos are wrapped with an ActionInfo instance (it caches attributes as well as the method info itself). Next, the ActionInfos, those who applies (HttpGet, HttpPost, etc.), are passed to the ReflectedMvc to create a ReflectedActionDescriptor.
At the end I left some bitterness about MVC internalization policy. I must admit, that I don’t understand internalizing types or constructors or any members which could be helpful for future app development. That case applies to the ReflectedActionDescriptor ctor, which (the public one) throws an exception if the method passed to it is a static method. The other, non-checking one, is of course internalized. I’m asking: why? The internalization should be used as a last resort, no as a ‘hide it, because I cannot predict if someone doesn’t want to crush MVC in his project’ rule.

Hope you like the action provider extension point and will use it for cross-cutting, ‘I can compose, not only derive from a generic class’ actions in your controllers.

An action provider, other than controller

Yep, that’s another post in a series which will finally bring so-called detached actions to live. Today I’ll introduce the concept of action providers, something, which in my opinion is missing in the ASP MVC.

The standard pipeline of handling a request in ASP MVC starts with creating a controller with implementation of IControllerFactory. The default one, uses simply a controller constructor, passing none to it. Then controller is returned and its Controller.ActionInvoker is queried for a method/action matching the request. If none is found, an exception is thrown. After reading this paragraph it must be clear, that adding an action to a controller can be done also on the action invoker level.

The IActionInvoker interface provides not much space to move, since it has only one, bool method. If we take a look into the default implementation, ControllerActionInvoker, we can find an interesting method called: FindAction. The method returns something called ActionDescriptor and in the default implementation scans only the controller type in the search of the actions. But we can describe some static methods and return them during the search. To address it, I introduced an IActionProvider interface, which can be (almost) easily implemented.

To get the whole idea, take a look into GIST: https://gist.github.com/989733

As always, all needed types should be registered in the Windsor Container and resolved with it (in the current example, the only need which is to be resolved is IControllerFactory, or nothing if you register service location with the new feature from MVC 3).

Hope you like it. In the very next post, the final implementation of one of IActionProviders.