The cost of scan queries in Azure Table Storage

There are multiple articles describing the performance of Azure Table Storage. You probably read the entry of Troy Hunt, Working with 154 million records on Azure Table Storage…. You may have invested your time in reading How to get most out of Windows Azure Tables as well. My question is have you really considered the limitations of the queries, specifically scan queries and how they can consume the major part of Azure Performance Targets.

The PartitionKey and RowKey create the primary and the only index in ATS (Azure Table Storage). Depending on the query the following kinds can be distinguished:

  1. Point Queries, which are queries to retrieve a single entity by specifying a single PartitionKey and RowKey using equality as predicate
  2. Row Range Queries, which  are queries to get a set of entities defined with the same PartitionKey and a range of RowKeys
  3. Partition Range Queries, which are run with a range of ParitionKeys
  4. Full table scans, which have no predicate for ParitionKey

What are the costs and limitations of the following queries? Unfortunately, every row that is accessed by the query to perform scan over will be counted as the table operation, Tthere ain’t no such thing as a free lunch. This means, that if you scan your entire table (4th scenario), you’ll be able to process no more than 20,000 entities per second. This limits the usage of large data sets’ scans. If you have to model queries across different keys, then you may consider storing the same value twice: once under the natural Parition/RowKey pair and the second time to match the other index, to create an inverted index. If any case, you’ll have to scan through the entire data set, then using ATS is not the way to go, and you should consider some other ways of modelling your data, like asynchronous copy data to blob, etc.

Do we really need all these data transformations?

Applications have layers. It’s still pretty common to see an enterprise application being built with layers like DAL, Business Logic (or Domain), Services, etc. Let’s not discuss this abomination itself. Let us rather consider the flow of the data within the application.

SELECT * FROM
That’s where the data are stored. Let us consider a good old-fashioned SQL Server. To get the data from the database you may use ADO (oh no!) or any new ORMs, including the micro ORMs like Dapper or something similar. What you end with is probably some kind of an object, or an object collection. Here’s where you start playing with data.

Mappings
It doesn’t matter whether you’re using Automapper or map the data on your own. For encapsulation purposes or getting an immutable version of an object it’s common to copy its values to a new representation. I know that strings are immutable and will be copied by reference, but you copy them as well.

Services
So you’ve got your data mapped to the right model. Now you can return them from your service. Ooops, it’s a fancy REST service and you translate the very same data again. Now, because it’s a browser asking and you use content negotiation, the data are transformed to JSON.

In onion architectures, you can meet even more transformations between layers, mappings from DTOs to DTOs are quite common. The question, not only from the architecture point of view, but from the performance oriented angle is the same: what are you doing? Why do you want to spend plenty of time to write all these mappings? Why do you want to melt the CPU in never ending mappings? Can you not skip all of these? Why not to store JSON in the database or use a database that supports JSON blobs as a first level citizen (RavenDB, MongoDB) and simply push the content retrieved from the database right to the output stream?

All the thoughts above have been provoked by services I’m creating now. Long story short, they store objects serialized with Google Protocol Buffers. When you access an object from an external system, a service just copies the blob without the deserialization right to the output stream. No deserialization, no allocations, no overhead. Simple and brutally fast.

Next time you come up with an onion design or layers of transformations ask yourself is it worth and if you can pay the price of doing all these mappings.

Disruptor with MultiProducer

I hope you’re aware of the LMAX tool for fast in memory processing called disruptor. If not, it’s a must-see for nowadays architects. It’s nice to see your process eating messages with speeds ~10 millions/s.
One of the problems addressed in the latest release was a fast multi producer allowing one to instantiate multiple tasks publishing data for their consumers. I must admit that the simplicity of this robust part is astonishing. How one could handle claiming and publishing a one or a few items from the buffer ring? Its easy, claim it in a standard way using CAS operation to let other threads know about the claimed value and publish it. But how publish this kind of info? Here’s come the beauty of this solution:

  1. allocate a int array of the buffer ring length
  2. when items are published calculate their positions in the ring (sequence % ring.length)
  3. set the values in the helper int array with numbers of sequences (or values got from them)

This, with overhead of int array allows:

  1. waiting for producer by simply checking the value in the int array, if it matches the current number of buffer iteration
  2. publishing in the same order items were claimed
  3. publishing with no additionals CASes

Simple, powerful and fast.
Come, take a look at it: MultiProducerSequencer

Deiphobus, no more SELECT n + 1

The previous post contained an information about lazy loading of group of properties, let’s call them families as it is called in the Cassandra. What about the following code. How many db hits you’d like to get by default?

using (var s = sessionFactory.Open())
{
	var user = s.Load<IUser>(5);
	foreach(var post in user.Posts)
	{
		Console.WriteLine(post.Title);
	}
}

I’ll tell you how many you’ll get. The answer is two: first hit will occur, when a collection of posts is accessed in the foreach loop, the second – when a title is printed on the console. During the second hit all the posts loaded in the session will have their titles loaded. In some cases it may drive to a small overhead, but it simplifies batching and working with your entities in the majority of cases. Would anyone like to set FetchMode, like it was done in the NHibernate? ;)

Cache me this, cache me that

Recently I’ve been optimizing one application. After a few runs of profile_analyzeData_fixBottlenecks I stopped when an app written with NHibernate intensively using CacheManager from EntLib was spending 5% of their time getting data from cache. As the performance was increased by an order of magnitude, the time spend within _GetData_ method was much to large. I reviewed code and wrote a few statements describing what application require from a cache:

  • no expiration and
  • it is infrequently erased so
  • has big read:write ratio
  • the number of cached items can be easly estimated, hence
  • the number of cached items will not frequently exceed the max number of item
  • once the scavenging occurs (current number > max number of items), it can take a while, since it has a low probability to hit the upper limit of cache items number

Following these points, the performance results of GetData method and need of tuning the app (it was the most obvious point for optimization) I came up with the following Cache Manager implementation. The code is simple, oriented for frequent reads (slim rw lock), with all operations having cost of O(1). Adding, when a maximum number of cache items is exceeded takes O(n) because scavenging takes O(n) to clean up cache (the FIFO is used for removing these entries). The removal of key is handled by marking it as removed in the key look up list.

The whole implementation with tests took me 40 minutes, which with other small optimizations improved performance by 10%. The main idea behind this is to give you some thoughts about using the external all in one solutions versus writing a small, custom one, for only your needs. Hope the code example help as well :-)

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
using Trader.Core.Exceptions;
using Trader.Infrastructure.IoC.Wcf;

namespace Trader.Infrastructure.Caching
{
    ///
    /// The implementation of ,
    /// using fifo strategy for scavenging to many objects in a cache.
    ///
    ///
    /// This implementation is thread safe and its instances can be shared
    /// by different threads.
    ///
    public class CacheManager : ICacheManager, IDisposable
    {
        private readonly Dictionary _cache;
        private readonly int _capacity;
        private readonly int _maxNumberOfItems;
        private readonly int _numberToRemove;
        private readonly ReaderWriterLockSlim _rwl = new ReaderWriterLockSlim();
        private List _keys;

        public CacheManager(string name, int maximumElementsInCacheBeforeScavenging, int numberToRemove)
        {
            if (maximumElementsInCacheBeforeScavenging < 1)
            {
                throw new ArgumentException("Cannot be less than 1", "maximumElementsInCacheBeforeScavenging");
            }
            if (numberToRemove < 1)             {                 throw new ArgumentException("Cannot be less than 1", "numberToRemove");             }             if (numberToRemove > maximumElementsInCacheBeforeScavenging)
            {
                throw new ArgumentException("Scavenging more than max number of elements in cache", "numberToRemove");
            }

            Name = name;
            _maxNumberOfItems = maximumElementsInCacheBeforeScavenging;
            _numberToRemove = numberToRemove;

            _capacity = maximumElementsInCacheBeforeScavenging + 1;
            _cache = new Dictionary(_capacity);
            _keys = new List(_capacity);
        }

        public string Name { get; private set; }

        public int Count
        {
            get
            {
                using (_rwl.Readlock())
                {
                    return _cache.Count;
                }
            }
        }

        ///
        /// Adds the specified key and value to the cache.
        ///
        /// The key of cache entry.
        /// The value of cache entry.
        ///
        /// This is O(1) operation, but when a number of cache items exceeded,
        /// it calls  method. See more .
        ///
        public void Add(string key, object value)
        {
            key.ThrowIfNull("key");
            using (_rwl.WriteLock())
            {
                CacheItem item;
                if (_cache.TryGetValue(key, out item))
                {
                    // if exists only update
                    item.Value = value;
                }
                else
                {
                    item = new CacheItem(value, _keys.Count, key);
                    _cache[key] = item;
                    _keys.Add(item);

                    // scavange if needed
                    if (_cache.Count > _maxNumberOfItems)
                    {
                        Scavange();
                    }
                }
            }
        }

        public bool Contains(string key)
        {
            using (_rwl.Readlock())
            {
                return _cache.ContainsKey(key);
            }
        }

        public void Flush()
        {
            using (_rwl.WriteLock())
            {
                _cache.Clear();
                _keys.Clear();
            }
        }

        public object GetData(string key)
        {
            using (_rwl.Readlock())
            {
                CacheItem result;
                _cache.TryGetValue(key, out result);

                if (result != null)
                {
                    return result.Value;
                }
                return null;
            }
        }

        ///
        /// Removes a specified key from cache.
        ///
        /// The key to be removed.
        ///
        /// When non existing key passed, simply does nothing.
        /// This is O(1) operation.
        ///
        public void Remove(string key)
        {
            using (_rwl.WriteLock())
            {
                CacheItem result;
                _cache.TryGetValue(key, out result);

                if (result != null)
                {
                    // mark key as removed for scavenging optimization
                    _keys[result.KeyIndex] = null;
                    _cache.Remove(key);
                }
            }
        }

        public void Dispose()
        {
            _rwl.Dispose();
        }

        ///
        /// Scavanges elements, deleting at most n entries, where n is numberToRemove
        /// from ctor parameter.
        ///
        /// A number of scavanged items.
        ///
        /// This method has O(n) complexity. It iterates and recreates index list.
        ///
        public int Scavange()
        {
            var removedSoFar = 0;
            for (var i = 0; i < _keys.Count && removedSoFar < _numberToRemove; i++)
            {
                if (_keys[i] != null)
                {
                    _cache.Remove(_keys[i].Key);
                    _keys[i] = null;
                    ++removedSoFar;
                }
            }

            var newKeys = new List(_capacity);
            for (var i = 0; i < _keys.Count; i++)
            {
                if (_keys[i] != null)
                {
                    newKeys.Add(_keys[i]);
                }
            }

            // fix indexes
            for (var i = 0; i < newKeys.Count; i++)
            {
                newKeys[i].KeyIndex = i;
            }

            _keys = newKeys;

            return removedSoFar;
        }

        #region Nested type: CacheItem

        [DebuggerDisplay("Cache item: Key {Key}, KeyIndex {KeyIndex}")]
        private class CacheItem
        {
            public readonly string Key;
            public object Value;
            public int KeyIndex;

            public CacheItem(object value, int keyIndex, string key)
            {
                Key = key;
                Value = value;
                KeyIndex = keyIndex;
            }
        }

        #endregion
    }
}

You don’t mess with Unity’s policies

Recently I ran into a problem. Quite well designed system degraded in terms of performance after a few commits. The whole architecture is wired with an infrastructure library written with unity container as a base for dependency injection. One of its features is a simple adding proxies, for instance adding some information to an exception’s data in case of throwing one. The interceptor doing it is quite simple:

public class ExceptionDataAppendingInterceptor : IInterceptor
{
	public void Intercept(IInvocation invocation)
	{
		try
		{
			invocation.Proceed();
		}
		catch (Exception ex)
		{
			ex.AppendMethodCallData(
			    invocation.Method, invocation.Arguments);
			throw;
		}
	}
}

The exension method appending data is a bit more complex but does not add any value to the post.

Having in mind that exception can be thrown at various points of an application and having it run for the very first time in the test environment, I configured unity to set proxy to the majority of created objects, which are resolved as interfaces (proxifing their interfaces).
When I run a profiled, it showed a major overhead created by one strategy, being responsible for wrapping a created object with use of DynamicProxy2. What the profiler shown was plenty of calls to IEnumerable extensions method. The strategy called one policy which before fixing looked like this:

public class InterceptorSelectorPolicy : IBuilderPolicy
{
	private readonly IDictionary<Type, List<Func<IBuilderContext, IInterceptor>>> 
		_activators;
	private static readonly ProxyGenerator ProxyGenerator = new ProxyGenerator();

	public InterceptorSelectorPolicy(
		IDictionary<Type, List<Func<IBuilderContext, IInterceptor>>> activators)
	{
		_activators = activators;
	}

	/// <summary>
	/// Determines whether the specified context is applicable for proxy generation.
	/// </summary>
	public bool IsApplicable(IBuilderContext context)
	{
		return _activators.ContainsKey(BuildKey.GetType(context.OriginalBuildKey));
	}

	/// <summary>
	/// Creates the proxy using <see cref="IBuilderContext.Existing"/> as target
	/// and <see cref="IBuilderContext.OriginalBuildKey"/> as proxied interface type.
	/// </summary>
	public object CreateProxy(IBuilderContext context)
	{
		var typeToProxy = BuildKey.GetType(context.OriginalBuildKey);

		return ProxyGenerator.CreateInterfaceProxyWithTarget(
			typeToProxy,
			context.Existing,
			_activators[typeToProxy]
			.Select(a => a(context))
			.ToArray());
	}
}

It’s worth to mentioned that it was called every time an object was created… After refactorization the code lost all the enumerable _create_state_machine_for_my_iterator_ stuff and was changed to:

public class InterceptorSelectorPolicy : IBuilderPolicy
{
	private readonly Type _typeToProxy;
	private readonly Func<IBuilderContext, IInterceptor>[] _activators;
	private static readonly ProxyGenerator ProxyGenerator = new ProxyGenerator();

	public InterceptorSelectorPolicy(Type typeToProxy, 
		Func<IBuilderContext, IInterceptor>[] activators)
	{
		_typeToProxy = typeToProxy;
		_activators = activators;
	}

	/// <summary>
	/// Gets a value indicating whether this proxified should be applied.
	/// </summary>
	/// <value>
	///     <c>True</c> if this instance is applicable; otherwise, <c>false</c>.
	/// </value>
	public bool IsApplicable
	{
		get { return _activators != null && _activators.Length > 0; }
	}

	/// <summary>
	/// Creates the proxy using <see cref="IBuilderContext.Existing"/> as target
	/// and <see cref="IBuilderContext.OriginalBuildKey"/> as proxied interface type.
	/// </summary>
	/// <param name="context">The context.</param>
	/// <returns>Returns created proxy.</returns>
	public object CreateProxy(IBuilderContext context)
	{
		return ProxyGenerator.CreateInterfaceProxyWithTarget(
			_typeToProxy,
			context.Existing,
			CreateInterceptors(context));
	}

	private IInterceptor[] CreateInterceptors(IBuilderContext context)
	{
		var result = new IInterceptor[_activators.Length];
		for (var i = 0; i < _activators.Length; i++)
		{
			result[i] = _activators[i](context);
		}
		return result;
	}
}

No dictionary look ups, no iterators, no needed overhead. The whole idea was simplified and now, for an object having no need of proxifing it is reduced to a simple, nolock _is_null_ array check called by a post build strategy. You don’t mess with Unity’s policies unless you have a nice stress test written.