Performance matters

TL;DR

This is a short follow-up post about Marten’s performance. It shows that saved allocations are not only about allocations and memory. It’s also about you CPU ticks, hence the speed of your library.

Moaaaaar performance !

Let me present you three pictures comparing performance before and after removing a lot of allocations. They were provided by Jeremy after benchmarking my PRs to Marten. My work was purely focused on allocations, but additionally, as shown below, it improved Marten’s speed of execution.

Events

The speed improvement isn’t that significant, but please take a look at the allocated bytes. Now it takes much less memory required before

events

Documents

The new insert is 10% faster and takes much less memory than before.

docs

Bulk inserts

Here, after enabling npgsql library to accept ArraySegment<char> I was able to reuse the same pooled writer. The new approach not only skips allocations but also leases a pooled writer only once. Just take a look at these numbers!

bulk_loading

Summary

When working on a library or a tool it’s good to think about performance and memory consumption. Even in a managed Garbage Collected world, using pooling for buffers or objects at all might not only reduce a memory consumption but also improve the overall speed of your creation.

 

Views’ warm up for Event Sourcing

When using Event Sourcing as a foundation for your solution, the command part is a solved problem. Just take an aggregate version, a command, apply onto a state and try append created events to a store, checking the version again. There is a read part of this as well, called views, which is nothing more than an aggregation of a subset of events from the system. This works like a live query, which consumes events from a log and applies them on the projection on and on. Considering that the number of events is constantly growing, how would you deploy a new version of application containing a new view which needs to be build from the beginning, from the very first event? Even with a well performing database applying a few millions of events can take a while.

Warm up routine

Let’s consider a following routine. Instead of calling views by their names, a system version is appended. Take a view ‘users’ as an example

  1. for version 1.0.0 it’s “users-1_0_0”
  2. for version 1.2.0 it’s “users-1_2_0”

Before publishing the new version and moving all users to use it, predeploy the application and run its view builder. The views will be rebuild in the background, taking needed time. Once the builder starts to have problems with getting last events as there’s no more data, the views are prebuilt and a new version can be deployed.

Cost and optimization

Of course these views rebuild can be tedious and long. These could increase costs of your app as well as put an additional pressure on the event store. If a cost/performance optimization is needed, you can consider detection of a view change, something very similar to what Rinat did a few years back. You may come up with something more explicit as well. For any mechanism, the rule would be that if a view is the same, your app uses the last existing version of the view.

Providing this warm up routine, especially when using blue green deployments can improve not only the performance of your application start (the new version scenario) but also can provide an environment for testing the deployment before switching to the new version.

StructLayoutKind.Sequential not

If you want to write a performant multi threaded application which actually is an aim of RampUp, you have to deal with padding. The gains can be pretty big, considering that the whole work with threads mean, that you need to give them their own spaces to work in.

False sharing

False sharing is nothing more than two or more threads trying to use memory that’s mapped to a single line of cache. The best case for any thread is to have their own memory space separated & by separation I mean having enough of padding on the right and on the left, to keep the spaces of two threads without any overlapping. The easiest way is to add additional 64 bytes (the size of a cache line) at the end and at the beginning of the struct/class to ensure that no other thread will be able to allocate memory close enough. This mechanism is called padding.

Padding

The easiest way to apply padding is applying StructLayoutAttribute. If StructLayoutKind.Sequential is used, then adding 4 Guid fields at the beginning and 4 Guid fields at the end should work just fine. The size of Guid is 16 bytes which give us needed 64 bytes. A harder way of doing it is using StructLayoutKind.Explicit as it requires to add FieldOffsetAttribute to every field of the structure/class, explicitly stating the offset in the memory. With this approach, it’s easy to start with 64 and leave some space at the end of the class.

Problem

StructLayoutKind.Sequential works perfectly. Almost. Unfortunately if any field has type that is not Sequential or Explicit CLR will simply ignore the sequential requirement and silently apply automatic layout ruining the padding. This is a regular case, all classes use Auto by default. Unfortunately it leaves the developer with the need of applying the fields offsets manually.

Solution

As I need this padding behavior for RampUp, I’m creating a small Fody weaver plugin called Padded which will automatically calculate offsets (possibly with some memory overhead) for any class/struct marked with a proper attribute. Hopefully, it will be useful not only for RampUp but for more, performance oriented projects

The cost of scan queries in Azure Table Storage

There are multiple articles describing the performance of Azure Table Storage. You probably read the entry of Troy Hunt, Working with 154 million records on Azure Table Storage…. You may have invested your time in reading How to get most out of Windows Azure Tables as well. My question is have you really considered the limitations of the queries, specifically scan queries and how they can consume the major part of Azure Performance Targets.

The PartitionKey and RowKey create the primary and the only index in ATS (Azure Table Storage). Depending on the query the following kinds can be distinguished:

  1. Point Queries, which are queries to retrieve a single entity by specifying a single PartitionKey and RowKey using equality as predicate
  2. Row Range Queries, which  are queries to get a set of entities defined with the same PartitionKey and a range of RowKeys
  3. Partition Range Queries, which are run with a range of ParitionKeys
  4. Full table scans, which have no predicate for ParitionKey

What are the costs and limitations of the following queries? Unfortunately, every row that is accessed by the query to perform scan over will be counted as the table operation, Tthere ain’t no such thing as a free lunch. This means, that if you scan your entire table (4th scenario), you’ll be able to process no more than 20,000 entities per second. This limits the usage of large data sets’ scans. If you have to model queries across different keys, then you may consider storing the same value twice: once under the natural Parition/RowKey pair and the second time to match the other index, to create an inverted index. If any case, you’ll have to scan through the entire data set, then using ATS is not the way to go, and you should consider some other ways of modelling your data, like asynchronous copy data to blob, etc.

Do we really need all these data transformations?

Applications have layers. It’s still pretty common to see an enterprise application being built with layers like DAL, Business Logic (or Domain), Services, etc. Let’s not discuss this abomination itself. Let us rather consider the flow of the data within the application.

SELECT * FROM
That’s where the data are stored. Let us consider a good old-fashioned SQL Server. To get the data from the database you may use ADO (oh no!) or any new ORMs, including the micro ORMs like Dapper or something similar. What you end with is probably some kind of an object, or an object collection. Here’s where you start playing with data.

Mappings
It doesn’t matter whether you’re using Automapper or map the data on your own. For encapsulation purposes or getting an immutable version of an object it’s common to copy its values to a new representation. I know that strings are immutable and will be copied by reference, but you copy them as well.

Services
So you’ve got your data mapped to the right model. Now you can return them from your service. Ooops, it’s a fancy REST service and you translate the very same data again. Now, because it’s a browser asking and you use content negotiation, the data are transformed to JSON.

In onion architectures, you can meet even more transformations between layers, mappings from DTOs to DTOs are quite common. The question, not only from the architecture point of view, but from the performance oriented angle is the same: what are you doing? Why do you want to spend plenty of time to write all these mappings? Why do you want to melt the CPU in never ending mappings? Can you not skip all of these? Why not to store JSON in the database or use a database that supports JSON blobs as a first level citizen (RavenDB, MongoDB) and simply push the content retrieved from the database right to the output stream?

All the thoughts above have been provoked by services I’m creating now. Long story short, they store objects serialized with Google Protocol Buffers. When you access an object from an external system, a service just copies the blob without the deserialization right to the output stream. No deserialization, no allocations, no overhead. Simple and brutally fast.

Next time you come up with an onion design or layers of transformations ask yourself is it worth and if you can pay the price of doing all these mappings.

Disruptor with MultiProducer

I hope you’re aware of the LMAX tool for fast in memory processing called disruptor. If not, it’s a must-see for nowadays architects. It’s nice to see your process eating messages with speeds ~10 millions/s.
One of the problems addressed in the latest release was a fast multi producer allowing one to instantiate multiple tasks publishing data for their consumers. I must admit that the simplicity of this robust part is astonishing. How one could handle claiming and publishing a one or a few items from the buffer ring? Its easy, claim it in a standard way using CAS operation to let other threads know about the claimed value and publish it. But how publish this kind of info? Here’s come the beauty of this solution:

  1. allocate a int array of the buffer ring length
  2. when items are published calculate their positions in the ring (sequence % ring.length)
  3. set the values in the helper int array with numbers of sequences (or values got from them)

This, with overhead of int array allows:

  1. waiting for producer by simply checking the value in the int array, if it matches the current number of buffer iteration
  2. publishing in the same order items were claimed
  3. publishing with no additionals CASes

Simple, powerful and fast.
Come, take a look at it: MultiProducerSequencer

Deiphobus, no more SELECT n + 1

The previous post contained an information about lazy loading of group of properties, let’s call them families as it is called in the Cassandra. What about the following code. How many db hits you’d like to get by default?

using (var s = sessionFactory.Open())
{
	var user = s.Load<IUser>(5);
	foreach(var post in user.Posts)
	{
		Console.WriteLine(post.Title);
	}
}

I’ll tell you how many you’ll get. The answer is two: first hit will occur, when a collection of posts is accessed in the foreach loop, the second – when a title is printed on the console. During the second hit all the posts loaded in the session will have their titles loaded. In some cases it may drive to a small overhead, but it simplifies batching and working with your entities in the majority of cases. Would anyone like to set FetchMode, like it was done in the NHibernate? 😉