ProtobufRaw vs protobuf-net

TL;DR

I’m working currently on SewingMachine, an OSS project of mine, that is aimed at unleashing the ultimate performance for your stateful services written in/for Service Fabric (more posts: here). In this post I’m testing further (previous test is here) whether it would be beneficial to write a custom unmanaged writer for protobuf-net using stackallocked memory.

SewingMachine is raw, very raw

SewingMachine works with pointers. When storing data, you pass an IntPtr with a length as a value. Effectively, it means that if you use a managed structure to serialize your data, finally you’ll need to either pin it (pinning is notification for GC to do not move object around when it’s pinned) or have it pinned from the very beginning (this approach could be beneficial if an object is large and has a long lifetime). If you don’t want to use managed memory, you could always use stackalloc to allocate a small amount of memory on stack, serialize to it, and then pass it as IntPtr. This is approach I’m testing now.

Small, fixed sized payloads

If a payload, whether it’s an event or a message is small and contains no fields of variable length (strings, arrays) you could estimate the maximum size it will take to get serialized. Next, instead of using Protobuf-net regular serializer, you could write (or emit during a post-compilation) a custom function to serialize a given type, like I did in this spike. Then it’s time to test it.

Performance tests and over 10x improvement

Again, as in the previous post about memory, the unsafe stackallock version shows that it could be beneficial to invest some more time as the performance benefit is just amazing . The raw version is 10x faster. Wow!

protoraw_vs_proto

Summary

Sometimes using raw unsafe code improves performance. It’s worth to try it, especially in situations where the interface you communicate with is already unsafe and requiring to use unsafe structures.

ThreadStatic vs stackalloc

TL;DR

I’m working currently on SewingMachine, an OSS project of mine, that is aimed at unleashing the ultimate performance for your stateful services written in/for Service Fabric (more posts: here). In this post I’m testing whether it would be beneficial to write a custom unmanaged writer for protobuf-net, instead of using some kind of object pooling with ThreadLocal.

ThreadStatic and stackalloc

ThreadStatic is the old black. It was good to use before async-await has been introduced. Now, when you don’t know on which thread your continuation will be run, it’s not that useful. Still, if you’re on a good old-fashioned synchronous path, it might be used for object pooling and keeping one object per thread. That’s how protobuf-net caches ProtoReader objects.

One could use it to cache locally a chunk of memory for serialization. This could be a managed or unmanaged chunk, but eventually, it would be used to pass data to some storage (in my case, SewingSession from SewingMachine). If the interface accepted unmanaged chunks, I could also use stackalloc for small objects, that I know how much memory will be occupied by. stackalloc provides a way to allocate some number of bytes from the stackframe. Yes, it’s unsafe so keep your belts fastened.

ThreadStatic vs stackalloc

I gave it a try and wrote a simple (if it’s dummy, I encourage you to share your thoughts in comments) test that either uses a ThreadStatic-pooled object with an array or a stackallocated and writes. You can find it in this gist.

How to test it? As always, to the rescue comes BenchmarkDotNet, the best benchmarking tool for any .NET dev. Let’s take a look at the summary now.

local_vs_threadstatic.png

Stackalloc wins.

There are several things that should be taken into consideration. Finally block, the real overhead of writing an object and so on and so forth. Still, it looks that for heavily optimized code and small objects, one could this to write them a bit faster.

Summary

Using stackallocated buffers is fun and can bring some performance benefits. If I find anything unusual or worth noticing with this approach, I’ll share my findings. As always, when working on performance, measure first, measure in the middle and at the end.

Performance matters

TL;DR

This is a short follow-up post about Marten’s performance. It shows that saved allocations are not only about allocations and memory. It’s also about you CPU ticks, hence the speed of your library.

Moaaaaar performance !

Let me present you three pictures comparing performance before and after removing a lot of allocations. They were provided by Jeremy after benchmarking my PRs to Marten. My work was purely focused on allocations, but additionally, as shown below, it improved Marten’s speed of execution.

Events

The speed improvement isn’t that significant, but please take a look at the allocated bytes. Now it takes much less memory required before

events

Documents

The new insert is 10% faster and takes much less memory than before.

docs

Bulk inserts

Here, after enabling npgsql library to accept ArraySegment<char> I was able to reuse the same pooled writer. The new approach not only skips allocations but also leases a pooled writer only once. Just take a look at these numbers!

bulk_loading

Summary

When working on a library or a tool it’s good to think about performance and memory consumption. Even in a managed Garbage Collected world, using pooling for buffers or objects at all might not only reduce a memory consumption but also improve the overall speed of your creation.

 

Views’ warm up for Event Sourcing

When using Event Sourcing as a foundation for your solution, the command part is a solved problem. Just take an aggregate version, a command, apply onto a state and try append created events to a store, checking the version again. There is a read part of this as well, called views, which is nothing more than an aggregation of a subset of events from the system. This works like a live query, which consumes events from a log and applies them on the projection on and on. Considering that the number of events is constantly growing, how would you deploy a new version of application containing a new view which needs to be build from the beginning, from the very first event? Even with a well performing database applying a few millions of events can take a while.

Warm up routine

Let’s consider a following routine. Instead of calling views by their names, a system version is appended. Take a view ‘users’ as an example

  1. for version 1.0.0 it’s “users-1_0_0”
  2. for version 1.2.0 it’s “users-1_2_0”

Before publishing the new version and moving all users to use it, predeploy the application and run its view builder. The views will be rebuild in the background, taking needed time. Once the builder starts to have problems with getting last events as there’s no more data, the views are prebuilt and a new version can be deployed.

Cost and optimization

Of course these views rebuild can be tedious and long. These could increase costs of your app as well as put an additional pressure on the event store. If a cost/performance optimization is needed, you can consider detection of a view change, something very similar to what Rinat did a few years back. You may come up with something more explicit as well. For any mechanism, the rule would be that if a view is the same, your app uses the last existing version of the view.

Providing this warm up routine, especially when using blue green deployments can improve not only the performance of your application start (the new version scenario) but also can provide an environment for testing the deployment before switching to the new version.

StructLayoutKind.Sequential not

If you want to write a performant multi threaded application which actually is an aim of RampUp, you have to deal with padding. The gains can be pretty big, considering that the whole work with threads mean, that you need to give them their own spaces to work in.

False sharing

False sharing is nothing more than two or more threads trying to use memory that’s mapped to a single line of cache. The best case for any thread is to have their own memory space separated & by separation I mean having enough of padding on the right and on the left, to keep the spaces of two threads without any overlapping. The easiest way is to add additional 64 bytes (the size of a cache line) at the end and at the beginning of the struct/class to ensure that no other thread will be able to allocate memory close enough. This mechanism is called padding.

Padding

The easiest way to apply padding is applying StructLayoutAttribute. If StructLayoutKind.Sequential is used, then adding 4 Guid fields at the beginning and 4 Guid fields at the end should work just fine. The size of Guid is 16 bytes which give us needed 64 bytes. A harder way of doing it is using StructLayoutKind.Explicit as it requires to add FieldOffsetAttribute to every field of the structure/class, explicitly stating the offset in the memory. With this approach, it’s easy to start with 64 and leave some space at the end of the class.

Problem

StructLayoutKind.Sequential works perfectly. Almost. Unfortunately if any field has type that is not Sequential or Explicit CLR will simply ignore the sequential requirement and silently apply automatic layout ruining the padding. This is a regular case, all classes use Auto by default. Unfortunately it leaves the developer with the need of applying the fields offsets manually.

Solution

As I need this padding behavior for RampUp, I’m creating a small Fody weaver plugin called Padded which will automatically calculate offsets (possibly with some memory overhead) for any class/struct marked with a proper attribute. Hopefully, it will be useful not only for RampUp but for more, performance oriented projects

The cost of scan queries in Azure Table Storage

There are multiple articles describing the performance of Azure Table Storage. You probably read the entry of Troy Hunt, Working with 154 million records on Azure Table Storage…. You may have invested your time in reading How to get most out of Windows Azure Tables as well. My question is have you really considered the limitations of the queries, specifically scan queries and how they can consume the major part of Azure Performance Targets.

The PartitionKey and RowKey create the primary and the only index in ATS (Azure Table Storage). Depending on the query the following kinds can be distinguished:

  1. Point Queries, which are queries to retrieve a single entity by specifying a single PartitionKey and RowKey using equality as predicate
  2. Row Range Queries, which  are queries to get a set of entities defined with the same PartitionKey and a range of RowKeys
  3. Partition Range Queries, which are run with a range of ParitionKeys
  4. Full table scans, which have no predicate for ParitionKey

What are the costs and limitations of the following queries? Unfortunately, every row that is accessed by the query to perform scan over will be counted as the table operation, Tthere ain’t no such thing as a free lunch. This means, that if you scan your entire table (4th scenario), you’ll be able to process no more than 20,000 entities per second. This limits the usage of large data sets’ scans. If you have to model queries across different keys, then you may consider storing the same value twice: once under the natural Parition/RowKey pair and the second time to match the other index, to create an inverted index. If any case, you’ll have to scan through the entire data set, then using ATS is not the way to go, and you should consider some other ways of modelling your data, like asynchronous copy data to blob, etc.

Do we really need all these data transformations?

Applications have layers. It’s still pretty common to see an enterprise application being built with layers like DAL, Business Logic (or Domain), Services, etc. Let’s not discuss this abomination itself. Let us rather consider the flow of the data within the application.

SELECT * FROM
That’s where the data are stored. Let us consider a good old-fashioned SQL Server. To get the data from the database you may use ADO (oh no!) or any new ORMs, including the micro ORMs like Dapper or something similar. What you end with is probably some kind of an object, or an object collection. Here’s where you start playing with data.

Mappings
It doesn’t matter whether you’re using Automapper or map the data on your own. For encapsulation purposes or getting an immutable version of an object it’s common to copy its values to a new representation. I know that strings are immutable and will be copied by reference, but you copy them as well.

Services
So you’ve got your data mapped to the right model. Now you can return them from your service. Ooops, it’s a fancy REST service and you translate the very same data again. Now, because it’s a browser asking and you use content negotiation, the data are transformed to JSON.

In onion architectures, you can meet even more transformations between layers, mappings from DTOs to DTOs are quite common. The question, not only from the architecture point of view, but from the performance oriented angle is the same: what are you doing? Why do you want to spend plenty of time to write all these mappings? Why do you want to melt the CPU in never ending mappings? Can you not skip all of these? Why not to store JSON in the database or use a database that supports JSON blobs as a first level citizen (RavenDB, MongoDB) and simply push the content retrieved from the database right to the output stream?

All the thoughts above have been provoked by services I’m creating now. Long story short, they store objects serialized with Google Protocol Buffers. When you access an object from an external system, a service just copies the blob without the deserialization right to the output stream. No deserialization, no allocations, no overhead. Simple and brutally fast.

Next time you come up with an onion design or layers of transformations ask yourself is it worth and if you can pay the price of doing all these mappings.