.NET volatile write performance degradation in x86

TL;DR

This is a summary of my investigation about writing a fast and well designed concurrent queue for akka.net which performance was drastically low for 32bit application. Take a look at PR here. If you’re interested in writing a well performing no-alloc applications with mechanical symapthy in mind or you’re simply interested in good .NET concurrency this post is for you.

Akka.NET

Akka.NET is an actor system’s implementation for .NET platform. It has been ported from Java. Recently I spent some time playing with it and reading through the codebase. One of the classes that I took a look into was UnboundedMailboxQueue using a general purpose .NET BCL’s concurrent queue. It looked strange to me, as knowing a structure Envelope that is passed through this queue one could implement a better queue. I did it in this PR lowering number of allocations by 10% and speeding up the queue by ~8%. Taking into consideration that queues are foundations of akka actors, this result was quite promising. I used the benchmark tests provided with the platform and it looked good. Fortunately Jeff Cyr run some tests on x86 providing results that were disturbing. On x86 the new queue was underperforming. Finally I closed the PR without providing this change.

The queue design

The custom queue provided by use a similar design to the original concurrent queue. The difference was using Envelope fields (there are two: message & sender) to mark message as published without using the concurrent queue state array. Again, knowing the structure you want to passed to the other side via a concurrent queue was vital for this design. You can’t make a universal general collection. Note ‘general’, not ‘generic’.

Volatile

To make the change finally visible to a queue’s consumer, Volatile.Write was used. The only difference was the type being written. In the BCL’s concurrent queue that was bool in an array. In my case it was an object. Both used different overloads of Volatile.Write(ref ….). For sake of reference, Volatile.Write ensures release barrier so if a queue’s consumer reads status with Volatile.Read (the aquire barrier), it will finally see the written value.

Some kind of reproduction

To know how .net is performing this operations I’ve used two types and run a sample application with x64 and x86. Let’s take a look at the code first.

struct VolatileInt
{
int _value;

public void Write(int value)
{
_value = value;
}

public void WriteVolatile(int value)
{
Volatile.Write(ref _value, value);
}
}

struct VolatileObject
{
object _value;

public void Write(object value)
{
_value = value;
}

public void WriteVolatile(object value)
{
Volatile.Write(ref _value, value);
}
}

It’s really nothing fancy. These two either write the value ensuring release fence or just write the value.

Windbg for x86

The methods had been prepared using RuntimeHelpers.PrepareMethod(). A Windbg instance was attached to the process. I loaded sos clr and took a look at method tables of these two types. Because methods had been prepared, they were jitted so I could easily take a look at the jitted assembler. Because x64 was performing well, let’s take a look at x86. At the beginning let’s check the non-object method, VolatileInt.VolatileWrite

cmp     byte ptr [ecx],al
mov     dword ptr [ecx],edx
ret

Nothing heavy here. Effectively, just move a memory and return. Let’s take a look at writing the object with VolatileObject.VolatileWrite

cmp     byte ptr [ecx],al
lea     edx,[ecx]
call    clr!JIT_CheckedWriteBarrierEAX

Wow! Beside moving some data an additional method is called. The method name is JIT_CheckedWriteBarrierEAX (you probably this now that there may be a group of JIT_CheckedWriteBarrier methods). What is it and why does it appear only in x86?

CoreCLR to the rescue

Take a look at the following snippet and compare blocks for x86 and non-x86? What can you see? For x86 there are additional fragments, including the one mentioned before JIT_CheckedWriteBarrierEAX. What does it do? Let’s take a look at another piece of CoreCLR here. Let’s not dive into this implementation right now and what is checked during this call, but just taking a look at first instructions of this method one can tell that it’ll cost more than the simple int operation

cmp edx,dword ptr [clr!g_lowest_address]
jb      clr!JIT_CheckedWriteBarrierEAX+0x35
cmp     edx,dword ptr [clr!g_highest_address]
jae     clr!JIT_CheckedWriteBarrierEAX+0x35
mov     dword ptr [edx],eax
cmp     eax,dword ptr [clr!g_ephemeral_low]
jb      clr!JIT_CheckedWriteBarrierEAX+0x37
cmp     eax,dword ptr [clr!g_ephemeral_high]
jae     clr!JIT_CheckedWriteBarrierEAX+0x37
shr     edx,0Ah

Summing up
If you want to write well performing code and truly want to support AnyCPU, a proper benchmark tests run with different architectures should be provided.
Sometimes, a gain in one will cost you a lot in another. Even if for now this PR didn’t make it, this was an interesting journey and an extreme learning experience. There’s nothing better that to answer a childish ‘why?’ on your own.

Roslyn coding conventions applied

Roslyn is a ‘compiler as a service’ provided for both VisualBasic.NET & C#. It has a thriving community of people providing new features for .NET languages. One of the most important parts of this community is a guideline how to contribute, which defines basic rules for coding and issuing pull requests. The most important part, not only from Roslyn perspective, but as general .NET guidance are Coding Conventions.

Avoid allocations in hot paths

This is the rule, that should be close to every .NET developer heart, not only these that work on a compiler. It’s not about ‘premature optimization’. It’s about writing performant code that actually can sustain its performance when executing its hot paths in majority of the requests. Give it a try, and when writing some code next time (today?) have this rule in mind. Awareness of this kind, may result in having no need for profiling your production system or making a dump just to know that allocating a list for every cell of two dimensional array wasn’t the best approach.

What’s your hot path

That’s a good question that everyone should answer on their system basis. I asked this question a few months ago for my RampUp library:

what’s the hot path for a system using message passing?

The answer was surprisingly obvious: the message passing itself. EventStore, using a similar approach uses classes for message passing. This plus every other object creates some GC pressure. Back then, I asked myself a question, is it possible to use structs for internal process communication and come up with a good way of passing them? My reasoning was following: if I remove the GC pressure from messages, then I remove the hottest path of allocations and this can greatly improve stability of my system. Was it easy? No it wasn’t as I needed to emit a lot of code and discover some interesting properties of CLR. Did it work? Yes, it did.

Next time when you write a piece of code or design a system keep the hot path question in your mind and answer it. It’s worth it.

Replacing a generic dictionary

There is a moment, when you profile your high-throughput system and you hit the wall. And it’s not your code but some BCL elements. That’s what happened in RampUp when I was profiling Write part of the buffer.

The writer is emitted, but as the very foundation it uses a dictionary of metadata stored per message type. The metadata are simple:

  • the message size
  • the offset of the message envelope

Before optimization it was using the generic Dictionary specified with the message type and the message metadata. It was horribly slow. As the set of messages does not change, you could easily provide a set of message types up-front. For each message type one can obtain RuntimeTypeHandle, which can be easily converted to long. With a set of longs, you could select minimal long and just subtract it from all the values. This would reduce the value to int. So here you are. You have a way of turning a type into int and ints are much easier to compare & to handle. One could even use a simple hash-map just to map between ints and metadata. This was the way to reduce the overhead of obtaining metadata with IntLookup<TValue>. After applying the change, write performance has increased by 10%.

Local translation table

It’s quite to common to use GUIDs as a unique identifiers across systems as they are… unique🙂 The GUID generation algorithm has a very important property: it’s local. Your code doesn’t have to contact any service or a database. It just works. There’s one problem with GUIDs, they are pretty lengthy taking 16 bytes to be stored.
Consider following example: you want to provide a descriptor for a finite set of distinguishable items, like events in event sourcing. You can use:

  • an event type, a string description of it to make it right. That’s the idea behind event storage within EventStore.
  • GUIDs and generate them them on developers machines. They will be unique but still lengthy when it comes to storing them
  • assign integers, but you will need to divide integers sets between module and be very careful to do not step into another module area

There’s one additional thing you can do. You can easily combine 2 and 3. Just use GUIDs on contracts, providing uniqueness, but when it comes to storing provide a translation table, persistent, ensured of its existence during start up, mapping GUIDs to ints (4 bytes) or even shorts (2 bytes). You can easily create this kind of table locally, one for each module/service, just to embrace all the contracts used by this module. This will lower the storage cost and still let you use nice properties of Guids.

Simple, easy and powerful.

Cache me this, cache me that

Recently I’ve been optimizing one application. After a few runs of profile_analyzeData_fixBottlenecks I stopped when an app written with NHibernate intensively using CacheManager from EntLib was spending 5% of their time getting data from cache. As the performance was increased by an order of magnitude, the time spend within _GetData_ method was much to large. I reviewed code and wrote a few statements describing what application require from a cache:

  • no expiration and
  • it is infrequently erased so
  • has big read:write ratio
  • the number of cached items can be easly estimated, hence
  • the number of cached items will not frequently exceed the max number of item
  • once the scavenging occurs (current number > max number of items), it can take a while, since it has a low probability to hit the upper limit of cache items number

Following these points, the performance results of GetData method and need of tuning the app (it was the most obvious point for optimization) I came up with the following Cache Manager implementation. The code is simple, oriented for frequent reads (slim rw lock), with all operations having cost of O(1). Adding, when a maximum number of cache items is exceeded takes O(n) because scavenging takes O(n) to clean up cache (the FIFO is used for removing these entries). The removal of key is handled by marking it as removed in the key look up list.

The whole implementation with tests took me 40 minutes, which with other small optimizations improved performance by 10%. The main idea behind this is to give you some thoughts about using the external all in one solutions versus writing a small, custom one, for only your needs. Hope the code example help as well🙂

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
using Trader.Core.Exceptions;
using Trader.Infrastructure.IoC.Wcf;

namespace Trader.Infrastructure.Caching
{
    ///
    /// The implementation of ,
    /// using fifo strategy for scavenging to many objects in a cache.
    ///
    ///
    /// This implementation is thread safe and its instances can be shared
    /// by different threads.
    ///
    public class CacheManager : ICacheManager, IDisposable
    {
        private readonly Dictionary _cache;
        private readonly int _capacity;
        private readonly int _maxNumberOfItems;
        private readonly int _numberToRemove;
        private readonly ReaderWriterLockSlim _rwl = new ReaderWriterLockSlim();
        private List _keys;

        public CacheManager(string name, int maximumElementsInCacheBeforeScavenging, int numberToRemove)
        {
            if (maximumElementsInCacheBeforeScavenging < 1)
            {
                throw new ArgumentException("Cannot be less than 1", "maximumElementsInCacheBeforeScavenging");
            }
            if (numberToRemove < 1)             {                 throw new ArgumentException("Cannot be less than 1", "numberToRemove");             }             if (numberToRemove > maximumElementsInCacheBeforeScavenging)
            {
                throw new ArgumentException("Scavenging more than max number of elements in cache", "numberToRemove");
            }

            Name = name;
            _maxNumberOfItems = maximumElementsInCacheBeforeScavenging;
            _numberToRemove = numberToRemove;

            _capacity = maximumElementsInCacheBeforeScavenging + 1;
            _cache = new Dictionary(_capacity);
            _keys = new List(_capacity);
        }

        public string Name { get; private set; }

        public int Count
        {
            get
            {
                using (_rwl.Readlock())
                {
                    return _cache.Count;
                }
            }
        }

        ///
        /// Adds the specified key and value to the cache.
        ///
        /// The key of cache entry.
        /// The value of cache entry.
        ///
        /// This is O(1) operation, but when a number of cache items exceeded,
        /// it calls  method. See more .
        ///
        public void Add(string key, object value)
        {
            key.ThrowIfNull("key");
            using (_rwl.WriteLock())
            {
                CacheItem item;
                if (_cache.TryGetValue(key, out item))
                {
                    // if exists only update
                    item.Value = value;
                }
                else
                {
                    item = new CacheItem(value, _keys.Count, key);
                    _cache[key] = item;
                    _keys.Add(item);

                    // scavange if needed
                    if (_cache.Count > _maxNumberOfItems)
                    {
                        Scavange();
                    }
                }
            }
        }

        public bool Contains(string key)
        {
            using (_rwl.Readlock())
            {
                return _cache.ContainsKey(key);
            }
        }

        public void Flush()
        {
            using (_rwl.WriteLock())
            {
                _cache.Clear();
                _keys.Clear();
            }
        }

        public object GetData(string key)
        {
            using (_rwl.Readlock())
            {
                CacheItem result;
                _cache.TryGetValue(key, out result);

                if (result != null)
                {
                    return result.Value;
                }
                return null;
            }
        }

        ///
        /// Removes a specified key from cache.
        ///
        /// The key to be removed.
        ///
        /// When non existing key passed, simply does nothing.
        /// This is O(1) operation.
        ///
        public void Remove(string key)
        {
            using (_rwl.WriteLock())
            {
                CacheItem result;
                _cache.TryGetValue(key, out result);

                if (result != null)
                {
                    // mark key as removed for scavenging optimization
                    _keys[result.KeyIndex] = null;
                    _cache.Remove(key);
                }
            }
        }

        public void Dispose()
        {
            _rwl.Dispose();
        }

        ///
        /// Scavanges elements, deleting at most n entries, where n is numberToRemove
        /// from ctor parameter.
        ///
        /// A number of scavanged items.
        ///
        /// This method has O(n) complexity. It iterates and recreates index list.
        ///
        public int Scavange()
        {
            var removedSoFar = 0;
            for (var i = 0; i < _keys.Count && removedSoFar < _numberToRemove; i++)
            {
                if (_keys[i] != null)
                {
                    _cache.Remove(_keys[i].Key);
                    _keys[i] = null;
                    ++removedSoFar;
                }
            }

            var newKeys = new List(_capacity);
            for (var i = 0; i < _keys.Count; i++)
            {
                if (_keys[i] != null)
                {
                    newKeys.Add(_keys[i]);
                }
            }

            // fix indexes
            for (var i = 0; i < newKeys.Count; i++)
            {
                newKeys[i].KeyIndex = i;
            }

            _keys = newKeys;

            return removedSoFar;
        }

        #region Nested type: CacheItem

        [DebuggerDisplay("Cache item: Key {Key}, KeyIndex {KeyIndex}")]
        private class CacheItem
        {
            public readonly string Key;
            public object Value;
            public int KeyIndex;

            public CacheItem(object value, int keyIndex, string key)
            {
                Key = key;
                Value = value;
                KeyIndex = keyIndex;
            }
        }

        #endregion
    }
}

You don’t mess with Unity’s policies

Recently I ran into a problem. Quite well designed system degraded in terms of performance after a few commits. The whole architecture is wired with an infrastructure library written with unity container as a base for dependency injection. One of its features is a simple adding proxies, for instance adding some information to an exception’s data in case of throwing one. The interceptor doing it is quite simple:

public class ExceptionDataAppendingInterceptor : IInterceptor
{
	public void Intercept(IInvocation invocation)
	{
		try
		{
			invocation.Proceed();
		}
		catch (Exception ex)
		{
			ex.AppendMethodCallData(
			    invocation.Method, invocation.Arguments);
			throw;
		}
	}
}

The exension method appending data is a bit more complex but does not add any value to the post.

Having in mind that exception can be thrown at various points of an application and having it run for the very first time in the test environment, I configured unity to set proxy to the majority of created objects, which are resolved as interfaces (proxifing their interfaces).
When I run a profiled, it showed a major overhead created by one strategy, being responsible for wrapping a created object with use of DynamicProxy2. What the profiler shown was plenty of calls to IEnumerable extensions method. The strategy called one policy which before fixing looked like this:

public class InterceptorSelectorPolicy : IBuilderPolicy
{
	private readonly IDictionary<Type, List<Func<IBuilderContext, IInterceptor>>> 
		_activators;
	private static readonly ProxyGenerator ProxyGenerator = new ProxyGenerator();

	public InterceptorSelectorPolicy(
		IDictionary<Type, List<Func<IBuilderContext, IInterceptor>>> activators)
	{
		_activators = activators;
	}

	/// <summary>
	/// Determines whether the specified context is applicable for proxy generation.
	/// </summary>
	public bool IsApplicable(IBuilderContext context)
	{
		return _activators.ContainsKey(BuildKey.GetType(context.OriginalBuildKey));
	}

	/// <summary>
	/// Creates the proxy using <see cref="IBuilderContext.Existing"/> as target
	/// and <see cref="IBuilderContext.OriginalBuildKey"/> as proxied interface type.
	/// </summary>
	public object CreateProxy(IBuilderContext context)
	{
		var typeToProxy = BuildKey.GetType(context.OriginalBuildKey);

		return ProxyGenerator.CreateInterfaceProxyWithTarget(
			typeToProxy,
			context.Existing,
			_activators[typeToProxy]
			.Select(a => a(context))
			.ToArray());
	}
}

It’s worth to mentioned that it was called every time an object was created… After refactorization the code lost all the enumerable _create_state_machine_for_my_iterator_ stuff and was changed to:

public class InterceptorSelectorPolicy : IBuilderPolicy
{
	private readonly Type _typeToProxy;
	private readonly Func<IBuilderContext, IInterceptor>[] _activators;
	private static readonly ProxyGenerator ProxyGenerator = new ProxyGenerator();

	public InterceptorSelectorPolicy(Type typeToProxy, 
		Func<IBuilderContext, IInterceptor>[] activators)
	{
		_typeToProxy = typeToProxy;
		_activators = activators;
	}

	/// <summary>
	/// Gets a value indicating whether this proxified should be applied.
	/// </summary>
	/// <value>
	///     <c>True</c> if this instance is applicable; otherwise, <c>false</c>.
	/// </value>
	public bool IsApplicable
	{
		get { return _activators != null && _activators.Length > 0; }
	}

	/// <summary>
	/// Creates the proxy using <see cref="IBuilderContext.Existing"/> as target
	/// and <see cref="IBuilderContext.OriginalBuildKey"/> as proxied interface type.
	/// </summary>
	/// <param name="context">The context.</param>
	/// <returns>Returns created proxy.</returns>
	public object CreateProxy(IBuilderContext context)
	{
		return ProxyGenerator.CreateInterfaceProxyWithTarget(
			_typeToProxy,
			context.Existing,
			CreateInterceptors(context));
	}

	private IInterceptor[] CreateInterceptors(IBuilderContext context)
	{
		var result = new IInterceptor[_activators.Length];
		for (var i = 0; i < _activators.Length; i++)
		{
			result[i] = _activators[i](context);
		}
		return result;
	}
}

No dictionary look ups, no iterators, no needed overhead. The whole idea was simplified and now, for an object having no need of proxifing it is reduced to a simple, nolock _is_null_ array check called by a post build strategy. You don’t mess with Unity’s policies unless you have a nice stress test written.