Async programming model

TL;DR

This is a follow up post to Async pump for better throughput in Azure. Please read the first before moving forward.

Feedback

I’ve been given a lot of feedback about my Async pump post. In a few cases this blog post from Ayende was quoted as it describes exactly the same approach. You can read the post, but more meaningful are comments provided by Kelly Sommers and Clemens Vasters.

The model

The await statement has simple semantics. It breaks your code and schedules the following part as a task continuation. This heavy lifting is done on the C# compiler level, so you don’t have to worry about. The model of this extension is simple: define a task and a continuation. Nothing more, nothing less. With my approach, that was a “trick”

That’s the premise of the “trick” that is allegedly achieving parallel execution of I/O and compute work here. That is, however, not the purpose of the asynchronous programming model and of the Windows IO completion port (IOCP) model. The point of IOCP is to efficiently offload IO work from user code to kernel and driver and hardware and not to bother the user code until the IO work is done

by Clemens Vasters

What it basically says is once you await on IO operation, your code that is run after, is scheduled on the IO thread

As IO typically takes very long and compute work is comparatively cheap, the goal of the IO system is to keep the thread count low (ideally one per core) and schedule all callbacks (and thus execution of interleaved user code) on that one thread. That means that, ideally, all work gets serialized and there minimal context switching as the OS scheduler owns the thread. The point of this model is ALREADY to keep the CPU nicely busy with parallel work while there is pending IO.

by Clemens Vasters

So with your code awaiting some IO, you’ll call IO, next the code after await is executed on the IO thread, as it’s assumed to be lightweight, another IO occurs, again the same thread dispatched the callback. With the async pump I proposed, comes a danger, as when we follow this partially awaitless approach the continuation code is

not on a .NET poll thread, but an IO pool thread. The strategy above sits on that IO pool thread past the point where you are supposed to return it (which is the next IO call) and thus you might force the IO pool to grow

by Clemens Vasters

Summary

Measure first. Think. Then think again. The “trick” worked in my case, improving performance for initialization of one app (I needed it in the beginning). Does it work in every case and should be used in general, I’d say no. It’s good to follow the programming model. If you don’t want to, you must have strong reasons for it.

Async pump for better throughput in Azure

This post is followed up by 
https://blog.scooletz.com/2017/02/20/async-programming-model

TL;DR

Introducing async-await has changed a lot. Now, with some compiler’s help we’re able to squeeze out more throughput from our machines, which may lower costs and increase throughput. In this blog post we’ll push the boundaries even further by questioning the need of immediate awaiting on a task.

Background

The story behind this pattern is simple. I’m using a part of the Azure Storage Services, the page blob. It provides storage API targeted at random IO & page aligned reads and writes. This is a perfect solution for emulating disk IO (it’s used for VMs’ disks), but could be used in cases where you want to have ability to write to a file in Azure, under specific index. You can just read a page, modify it and write it back. If you’re interested in this topic, take a look at my talks, maybe we’ll meet during my presentation Keep Its Storage Simple Stupid.

Synchronous

It’s obvious that for reading/writing from Storage Services you want to use asyncified code. Using blocking calls like the one below freezes a thread, which considering the money you already pay, is not the best option. Remember, it’s the cloud and regular Storage Services are backed up by HDD disks. It might take a while. Still, let’s take a look at the sync version first. We’ll operate on a stream that has been opened from the blob.

pump.png

The method above reads the buffer from a stream. It dispatches a read after a read to fill the buffer.

Asynchronous

The async-ified version of this reader is not that different. It uses just async Task in its signature and awaits one the Read. We’ll have no blocking calls, leaving some spare CPU cycles for other operations.

pump

Asynchronous pump

In the last attempt, we need to ask what do we read this buffer for? In my case, that’s for scanning over its content. I need to read a page blob from a given position and scan/deserialize it in a C# code. I do not want to preserve the buffer. As soon as I read it, I can move on. It’s just about reading a log, nothing more. The second property of it is: entries in this log are aligned to pages, so are well aligned for reading. Can we modify the reading part then?

We could think of using two buffers/streams. Schedule a read in the first, then in a loop:

  1. schedule a read in the second
  2. await on the first
  3. swap firstsecond
  4. go to 1

If we used this algorithm, we’d have a higher probability of one operation being already ended and ready to be dispatched. This means that our algorithm, possibly, could work on prefetched data without any interruptions, having the data ready when it needs it. For sake of simplicity, the buffer array, ReadBuffer were closed in a simple helper class called Buffer.

pump.png

Summary

Having something ready to be awaited does not mean that you should await it immediately. Using this two buffer approach can increase the throughput of your algorithm by ensuring that data are fetched before they are needed. Now it’s time for you to search for some pumping potential in your code!

Concurrent conditional deletes

TL;DR

Sometimes System.Collections.Concurrent provide not enough methods to squeeze the top performance out of them. Below you can find a small extension method that will make your life easier in some cases

Concurrent dictionary

The concurrent dictionary provides a lot of useful methods. You can TryAdd, AddOrUpdate. Additionally, you can use TryUpdate which provides a very nice method for ensuring optimistic concurrency. Let’s take a look at its signature:

public bool TryUpdate(TKey key, TValue newValue, TValue comparisonValue)

It enables to replace a value under a specific key with the newValue only if the previous value is equal to comparisonValue. It’s an extremely powerful API. If you create a new object for a specific key to update it, it enables to replace that object without locks with a single method call. Of course, if the method fails, it returns false and it’s up to you to retry.

What about deletes? What if I wanted to remove an entry only if nobody changed it’s value in the meantime? What if I wanted to have an optimistic concurrency for deletes as well. Is there a method for it? Unfortunately no. The only method for removal is

public bool TryRemove(TKey key, out TValue value)

which removes the value unconditionally returning it. This breaks the optimistic concurrency as we can’t ensure that the removed entry wasn’t modified. What can be done to make it conditional?

Explicit interfaces

The ConcurrentDictionary class implements a lot of interfaces, one of them is

ICollection<KeyValuePair<TKey, TValue>>

This interface has one particular method, enabling to remove a pair of values.

ICollection<KeyValuePair<TKey, TValue>>.Remove(KeyValuePair<TKey, TValue> kvp

If you take a look into implementation, it uses a private method of the dictionary to remove the key only if the value is equal to the value of the pair. Now we can write a simple extension method to provide a conditional, optimistically concurrent removal of a key

static class ConcurrentDictionaryExtensions
{
    public static bool TryRemoveConditionally<TKey, TValue>(
        this ConcurrentDictionary<TKey, TValue> dictionary, TKey key, TValue previousValueToCompare)
    {
        var collection = (ICollection<KeyValuePair<TKey, TValue>>)dictionary;
        var toRemove = new KeyValuePair<TKey, TValue>(key, previousValueToCompare);
        return collection.Remove(toRemove);
    }
}

which closes the gap and makes the API of the concurrent dictionary support all the operations under optimistic concurrency.

Summary

With this simple tweak you can use a concurrent dictionary as a collection that supports fully an optimistic concurrency.

Concurrency – ramp up your skills

Yesterday, I gave my Extreme Concurrency talk at rg-dev user group. After the talk I had some really interesting discussions and was asked to provide some resources in the low level concurrency I was talking about. So here’s the list of books, talks and blog posts that can help you to ramp up your skills

Videos:

  1. [C++] Herb Sutter “atomic Weapons” – it’s about C++ but covers memory models in a way, that’s easy to follow and learn how it works
    1. Part 1
    2. Part 2

MSDN:

  1. .NET Volatile class – it has a good description of what half-barriers are and properly shows two counterparts Read & Write methods
  2. .NET Interlocked class – the other class with a good description providing methods that are executed atomically. Basically, these methods are JITted as single assembler operations.

Code:

  1. RampUp – a project of mine 🙂
  2. [JAVA] Aeron – the messaging library

Books:

  1. Concurrent programming on Windows by Joe Duffy – this is a hard book to go through. It’s demanding and requires a lot of effort but is the best book if you want to really understand this topic

Blogs:

  1. Volatile reads and writes by Joe Duffy
  2. Sayonara volatile by Joe Duffy
  3. Atomicity, volatility and immutability are different by Eric Lippert – that’s the last part of this series
  4. [JAVA] Psychosomatic, lobotomy, saw – the name is strange but you won’t find here disturbing videos. What you’ll find though, is a deep-dive into memory models.

Wire improvements

TL;DR

This post sums up my recent involvement in the Wire serializer being built by the Akka.NET team. It was an interesting OSS experience. I’m glad that I could improve the performance of this serializer .

Box me not

The initial design of constructing a serializer of a complex object was based on emitting a pair of delegates (serialize, deserialize). Let’s consider deserialization:

  1. Create object
  2. For each field:
    1. Use the following method of a ValueSerializer:

      public abstract object ReadValue(
         Stream stream, DeserializerSession session)
      
    2. Cast to the field type or ubox if the field is a value type
    3. Load the deserialized object and set the field.

The problem with this approach was manifesting for classes with many primitive properties/fields the cost of boxing/unboxing might be quite high. On the other hand, one couldn’t be calling a method returning a common type without boxing for value types. How would you approach this? My answer was to move emitting point 1 & 2 to the value serializer and provide a generic implementation for this emitting in the basic class. That allowed me to customize the emitting for primitive value types like int, long, Guid still preserving a possibly bit less performing generic approach.

After the change and delegating the emitting to the ValueSerializer class, it was given two new virtual methods with their implementation using boxing etc, but leaving much more space for special cases

  
public virtual int EmitReadValue(Compiler<ObjectReader> c, 
   int stream, int session, FieldInfo field) 
public virtual void EmitWriteValue(Compiler<ObjectWriter> c, 
   int stream, int fieldValue, int session) 

Call me once

Wire uses a BCL’s Stream abstraction to work with a stream of bytes. When using a stream, a conversion to primitive value types sometimes require a helper byte array to get all the data in one Read call. For instance, a long value is stored as 8 bytes, hence it requires an 8 byte array. To remove allocations, during serialization and deserialization, a byte chunk can be obtained by calling GetBuffer method on the session object. What if your object contains 4 longs. Should every serializer call this method or maybe it should be called once and the result stored in a variable?

I took the second approach and by removing this additional calls to GetBuffer was able to squeeze again a bit more from Wire.

Summing up

Working on these features was a very pleasant experience. It took me just a few evenings and I was able to make a positive impact. And this is really nice.

.NET volatile write performance degradation in x86

TL;DR

This is a summary of my investigation about writing a fast and well designed concurrent queue for akka.net which performance was drastically low for 32bit application. Take a look at PR here. If you’re interested in writing a well performing no-alloc applications with mechanical symapthy in mind or you’re simply interested in good .NET concurrency this post is for you.

Akka.NET

Akka.NET is an actor system’s implementation for .NET platform. It has been ported from Java. Recently I spent some time playing with it and reading through the codebase. One of the classes that I took a look into was UnboundedMailboxQueue using a general purpose .NET BCL’s concurrent queue. It looked strange to me, as knowing a structure Envelope that is passed through this queue one could implement a better queue. I did it in this PR lowering number of allocations by 10% and speeding up the queue by ~8%. Taking into consideration that queues are foundations of akka actors, this result was quite promising. I used the benchmark tests provided with the platform and it looked good. Fortunately Jeff Cyr run some tests on x86 providing results that were disturbing. On x86 the new queue was underperforming. Finally I closed the PR without providing this change.

The queue design

The custom queue provided by use a similar design to the original concurrent queue. The difference was using Envelope fields (there are two: message & sender) to mark message as published without using the concurrent queue state array. Again, knowing the structure you want to passed to the other side via a concurrent queue was vital for this design. You can’t make a universal general collection. Note ‘general’, not ‘generic’.

Volatile

To make the change finally visible to a queue’s consumer, Volatile.Write was used. The only difference was the type being written. In the BCL’s concurrent queue that was bool in an array. In my case it was an object. Both used different overloads of Volatile.Write(ref ….). For sake of reference, Volatile.Write ensures release barrier so if a queue’s consumer reads status with Volatile.Read (the aquire barrier), it will finally see the written value.

Some kind of reproduction

To know how .net is performing this operations I’ve used two types and run a sample application with x64 and x86. Let’s take a look at the code first.

struct VolatileInt
{
int _value;

public void Write(int value)
{
_value = value;
}

public void WriteVolatile(int value)
{
Volatile.Write(ref _value, value);
}
}

struct VolatileObject
{
object _value;

public void Write(object value)
{
_value = value;
}

public void WriteVolatile(object value)
{
Volatile.Write(ref _value, value);
}
}

It’s really nothing fancy. These two either write the value ensuring release fence or just write the value.

Windbg for x86

The methods had been prepared using RuntimeHelpers.PrepareMethod(). A Windbg instance was attached to the process. I loaded sos clr and took a look at method tables of these two types. Because methods had been prepared, they were jitted so I could easily take a look at the jitted assembler. Because x64 was performing well, let’s take a look at x86. At the beginning let’s check the non-object method, VolatileInt.VolatileWrite

cmp     byte ptr [ecx],al
mov     dword ptr [ecx],edx
ret

Nothing heavy here. Effectively, just move a memory and return. Let’s take a look at writing the object with VolatileObject.VolatileWrite

cmp     byte ptr [ecx],al
lea     edx,[ecx]
call    clr!JIT_CheckedWriteBarrierEAX

Wow! Beside moving some data an additional method is called. The method name is JIT_CheckedWriteBarrierEAX (you probably this now that there may be a group of JIT_CheckedWriteBarrier methods). What is it and why does it appear only in x86?

CoreCLR to the rescue

Take a look at the following snippet and compare blocks for x86 and non-x86? What can you see? For x86 there are additional fragments, including the one mentioned before JIT_CheckedWriteBarrierEAX. What does it do? Let’s take a look at another piece of CoreCLR here. Let’s not dive into this implementation right now and what is checked during this call, but just taking a look at first instructions of this method one can tell that it’ll cost more than the simple int operation

cmp edx,dword ptr [clr!g_lowest_address]
jb      clr!JIT_CheckedWriteBarrierEAX+0x35
cmp     edx,dword ptr [clr!g_highest_address]
jae     clr!JIT_CheckedWriteBarrierEAX+0x35
mov     dword ptr [edx],eax
cmp     eax,dword ptr [clr!g_ephemeral_low]
jb      clr!JIT_CheckedWriteBarrierEAX+0x37
cmp     eax,dword ptr [clr!g_ephemeral_high]
jae     clr!JIT_CheckedWriteBarrierEAX+0x37
shr     edx,0Ah

Summing up
If you want to write well performing code and truly want to support AnyCPU, a proper benchmark tests run with different architectures should be provided.
Sometimes, a gain in one will cost you a lot in another. Even if for now this PR didn’t make it, this was an interesting journey and an extreme learning experience. There’s nothing better that to answer a childish ‘why?’ on your own.

Task.WhenAll tests

In the last post I’ve shown some optimizations one can apply to reduce the overhead on creating asynchronous state machines. Let’s dive into the async world again and consider helper methods provided by the Task class, especially Task.WhenAll.

public static Task WhenAll(params Task[] tasks)

The method works in a following way. It accepts an array of tasks and returns a task that will finish as soon as all of the underlying tasks are finished. Applying this method in some scenarios may provide an improvement gain, as one can run a few tasks in parallel. It has a drawback though.

Let’s consider following code

public Task<int> A()
{
  await B1();
  await B2();
  return C ();
}

If b1 and b2 could be executed in parallel (for instance, they access Azure Table Storage), this method could be rewritten in a following way.


public Task<int> A()
{
  await Task.WhenAll(B1(), B2());
  return C ();
}

What, beside the mentioned performance improvement, changed? Now, method A is no longer a method A. There are two methods which can be randomly executed. One running operations in the following order: B1, B2, C, and the other: B2, B1, C. This effectively means, that your previous test coverage is no longer true. If you want to truly test it, you need to provide suites that will order these B* calls properly and ensure that all permutations will be emitted and tested. Sometimes it has no meaning, sometimes it has. Let’s consider a following scenario:

  • Two callers are calling A in the same time
  • Every B method removes a specific file, failing if it does not exist
  • At least one of the callers should succeed

In the first version, it was a pure race for being the first. The first that goes through B1, B2, C would execute properly. Now consider the second version of A with two callers executing following operations in the specified order:

  • Caller 1: B1, B2, C
  • Caller 2: B2, B1, C

As you can see, it’s a typical deadlock scenario and both callers would fail.

As always, there’s no silver bullet and if you want to use Task.WhenAll to speed up your application by running operations in parallel, you must embrace the fact of a possibly non linear execution.

Happy awaiting.