Pearls: Jil, serialization of primitives

The last pearl of design that I covered was an implementation for the discriminated union in the probuf-net library. Now, it’s time to move to an area that is less esoteric in terms of the format, but still intriguing in terms of performance. Time to take a look at the fastest JSON serializer available for .NET, Jil.

Is it fast?

Jil is crazy fast. As always, the best way to do something fast, is to skip some steps. You can think about skipping doing some computations, to effectively lower the load for the CPU. When speaking about programming in .NET, one additional thing to skip are allocations. And when speaking about allocations and JSON, the one that should come to our minds are strings. I don’t mean strings per se, but object that are transformed into them before writing to the actual input.

A unique case of GUID

GUID in .NET (or Guid) stands for Globally Unique Identifier. It’s a structure, 16 bytes long, providing a unique identity generator (let’s not deal with comb guids and other types for now). It’s frequently used whenever its the client responsibility to generate an id. It’s also cheap, as you don’t need to call the third party to get a globally unique id.

Let’s take a look how Guid is serialized by another library for .NET, probably, the most popular one, JSON.NET

As you can see above (and you can see it on GitHub), JSON.NET first calls .ToString, allocating  32 characters in a form of a string, just to pass it to the TextWriter instance. Yes, this instance will end its life shortly after writing it to the writer, but still, it will be allocated. What could be done better?

Jil to the rescue

We’ve got a text writer. If we had an additional buffer of char[] we could write characters from Guid to the buffer and then pass it to the text writer. Unfortunately, Guid does not provide a method to write its output to the buffer. Jil, provides one instead. It has an additional Guid structure that enables to access internal fields’ values of Guid. With this and a buffer, we can write to a text writer without allocating an actual string. The code is much to big to paste it in here. Probably it must be big to be that fast;-) Just follow this link to see the selected lines.

Summary

The mentioned optimization for Guid type is just one of many, that you can find in the awesome Jil library. As always, do not allocate, is one of the top commandments when talking about performance.

Pearls: the protobuf’s discriminated union

Google Protocol Buffers is a proven protocol for serializing data efficiently. It has a wide adoption, enabling serialization for almost every platform, making the data easy to exchange between platforms. To store its schema, you can use .proto files, that enable describing messages in a platform agnostic format. You can see an example below:

message SearchRequest {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
}

One of

Sometimes you want to define a message that will have only one of its fields initialized. This can be useful when sending a message providing a wrap around multiple types or in any other case that requires it. Take a look at the example, providing a wrap around three messages of the following types: MessageA, MessageB and MessageC.

message Wrapper {
  oneof OnlyOneOf {
    MessageA a = 1;
    MessageB b = 2;
    MessageC c = 3;  
  }
}

Protocol buffers use a tag value to store any field. First, the tag (1, 2, 3 above) and the type of the field is stored, next, the value is written. In the following case, where only one field is assigned, it would write its tag, type and value. Do we need to create a class that will contain all the fields? Do we need to waste this space for storing fields?

Protobuf-net Discriminated Union

To fix this wasteful generation, protobuf-net, a library build by Marc Gravell, added Discriminated Union types that is capable of addressing it. Let’s just take a look at the first of them.

public struct DiscriminatedUnionObject
{
  private readonly int _discriminator;

 // The value typed as Object
 public readonly object Object;

 // Indicates whether the specified discriminator is assigned
 public bool Is(int discriminator) => _discriminator == ~discriminator;

 // Create a new discriminated union value
 public DiscriminatedUnionObject(int discriminator, object value)
 {
  _discriminator = ~discriminator; // avoids issues with default value / 0
  Object = value;
 }
}

Let’s walk through all the design choices that have been made here:

  1. DiscriminatedUnionObject is a struct. It means, that if a class have a field of this type, it will be stored in the object, without additional allocations (you can think of it as inlining the structure, creating a “fat object”)
  2. It has only one field for storing the value Object. (no matter which type is it).
  3. It has only one field, called _discriminator to store the tag of the field.

If you generated the Wrapper class, it’d have only one field, of the DiscriminatedUnionObject type. Once a message of a specified type is set, the discriminator and the value would be written in the union. Simple, and efficient.

Summing up

Mapping a generic idea, like a discriminated union, into a platform or a language isn’t simple. Again, once it’s made in an elegant and an efficient way, I truly believe that it’s worth to be named as a pearl.

Performance matters

TL;DR

This is a short follow-up post about Marten’s performance. It shows that saved allocations are not only about allocations and memory. It’s also about you CPU ticks, hence the speed of your library.

Moaaaaar performance !

Let me present you three pictures comparing performance before and after removing a lot of allocations. They were provided by Jeremy after benchmarking my PRs to Marten. My work was purely focused on allocations, but additionally, as shown below, it improved Marten’s speed of execution.

Events

The speed improvement isn’t that significant, but please take a look at the allocated bytes. Now it takes much less memory required before

events

Documents

The new insert is 10% faster and takes much less memory than before.

docs

Bulk inserts

Here, after enabling npgsql library to accept ArraySegment<char> I was able to reuse the same pooled writer. The new approach not only skips allocations but also leases a pooled writer only once. Just take a look at these numbers!

bulk_loading

Summary

When working on a library or a tool it’s good to think about performance and memory consumption. Even in a managed Garbage Collected world, using pooling for buffers or objects at all might not only reduce a memory consumption but also improve the overall speed of your creation.

 

.NET volatile write performance degradation in x86

TL;DR

This is a summary of my investigation about writing a fast and well designed concurrent queue for akka.net which performance was drastically low for 32bit application. Take a look at PR here. If you’re interested in writing a well performing no-alloc applications with mechanical symapthy in mind or you’re simply interested in good .NET concurrency this post is for you.

Akka.NET

Akka.NET is an actor system’s implementation for .NET platform. It has been ported from Java. Recently I spent some time playing with it and reading through the codebase. One of the classes that I took a look into was UnboundedMailboxQueue using a general purpose .NET BCL’s concurrent queue. It looked strange to me, as knowing a structure Envelope that is passed through this queue one could implement a better queue. I did it in this PR lowering number of allocations by 10% and speeding up the queue by ~8%. Taking into consideration that queues are foundations of akka actors, this result was quite promising. I used the benchmark tests provided with the platform and it looked good. Fortunately Jeff Cyr run some tests on x86 providing results that were disturbing. On x86 the new queue was underperforming. Finally I closed the PR without providing this change.

The queue design

The custom queue provided by use a similar design to the original concurrent queue. The difference was using Envelope fields (there are two: message & sender) to mark message as published without using the concurrent queue state array. Again, knowing the structure you want to passed to the other side via a concurrent queue was vital for this design. You can’t make a universal general collection. Note ‘general’, not ‘generic’.

Volatile

To make the change finally visible to a queue’s consumer, Volatile.Write was used. The only difference was the type being written. In the BCL’s concurrent queue that was bool in an array. In my case it was an object. Both used different overloads of Volatile.Write(ref ….). For sake of reference, Volatile.Write ensures release barrier so if a queue’s consumer reads status with Volatile.Read (the aquire barrier), it will finally see the written value.

Some kind of reproduction

To know how .net is performing this operations I’ve used two types and run a sample application with x64 and x86. Let’s take a look at the code first.

struct VolatileInt
{
int _value;

public void Write(int value)
{
_value = value;
}

public void WriteVolatile(int value)
{
Volatile.Write(ref _value, value);
}
}

struct VolatileObject
{
object _value;

public void Write(object value)
{
_value = value;
}

public void WriteVolatile(object value)
{
Volatile.Write(ref _value, value);
}
}

It’s really nothing fancy. These two either write the value ensuring release fence or just write the value.

Windbg for x86

The methods had been prepared using RuntimeHelpers.PrepareMethod(). A Windbg instance was attached to the process. I loaded sos clr and took a look at method tables of these two types. Because methods had been prepared, they were jitted so I could easily take a look at the jitted assembler. Because x64 was performing well, let’s take a look at x86. At the beginning let’s check the non-object method, VolatileInt.VolatileWrite

cmp     byte ptr [ecx],al
mov     dword ptr [ecx],edx
ret

Nothing heavy here. Effectively, just move a memory and return. Let’s take a look at writing the object with VolatileObject.VolatileWrite

cmp     byte ptr [ecx],al
lea     edx,[ecx]
call    clr!JIT_CheckedWriteBarrierEAX

Wow! Beside moving some data an additional method is called. The method name is JIT_CheckedWriteBarrierEAX (you probably this now that there may be a group of JIT_CheckedWriteBarrier methods). What is it and why does it appear only in x86?

CoreCLR to the rescue

Take a look at the following snippet and compare blocks for x86 and non-x86? What can you see? For x86 there are additional fragments, including the one mentioned before JIT_CheckedWriteBarrierEAX. What does it do? Let’s take a look at another piece of CoreCLR here. Let’s not dive into this implementation right now and what is checked during this call, but just taking a look at first instructions of this method one can tell that it’ll cost more than the simple int operation

cmp edx,dword ptr [clr!g_lowest_address]
jb      clr!JIT_CheckedWriteBarrierEAX+0x35
cmp     edx,dword ptr [clr!g_highest_address]
jae     clr!JIT_CheckedWriteBarrierEAX+0x35
mov     dword ptr [edx],eax
cmp     eax,dword ptr [clr!g_ephemeral_low]
jb      clr!JIT_CheckedWriteBarrierEAX+0x37
cmp     eax,dword ptr [clr!g_ephemeral_high]
jae     clr!JIT_CheckedWriteBarrierEAX+0x37
shr     edx,0Ah

Summing up
If you want to write well performing code and truly want to support AnyCPU, a proper benchmark tests run with different architectures should be provided.
Sometimes, a gain in one will cost you a lot in another. Even if for now this PR didn’t make it, this was an interesting journey and an extreme learning experience. There’s nothing better that to answer a childish ‘why?’ on your own.

Roslyn coding conventions applied

Roslyn is a ‘compiler as a service’ provided for both VisualBasic.NET & C#. It has a thriving community of people providing new features for .NET languages. One of the most important parts of this community is a guideline how to contribute, which defines basic rules for coding and issuing pull requests. The most important part, not only from Roslyn perspective, but as general .NET guidance are Coding Conventions.

Avoid allocations in hot paths

This is the rule, that should be close to every .NET developer heart, not only these that work on a compiler. It’s not about ‘premature optimization’. It’s about writing performant code that actually can sustain its performance when executing its hot paths in majority of the requests. Give it a try, and when writing some code next time (today?) have this rule in mind. Awareness of this kind, may result in having no need for profiling your production system or making a dump just to know that allocating a list for every cell of two dimensional array wasn’t the best approach.

What’s your hot path

That’s a good question that everyone should answer on their system basis. I asked this question a few months ago for my RampUp library:

what’s the hot path for a system using message passing?

The answer was surprisingly obvious: the message passing itself. EventStore, using a similar approach uses classes for message passing. This plus every other object creates some GC pressure. Back then, I asked myself a question, is it possible to use structs for internal process communication and come up with a good way of passing them? My reasoning was following: if I remove the GC pressure from messages, then I remove the hottest path of allocations and this can greatly improve stability of my system. Was it easy? No it wasn’t as I needed to emit a lot of code and discover some interesting properties of CLR. Did it work? Yes, it did.

Next time when you write a piece of code or design a system keep the hot path question in your mind and answer it. It’s worth it.

False sharing is dead, long live the Padded

False sharing is a common problem of multithreaded applications in .NET. If you allocate objects in/for different threads, they may land on the same cache line impacting the performance, limiting gains from scaling your app on a single machine. Unfortunately, because of the multithreaded nature of the RampUp library it’s been suffering from the same condition. I’ve decided to address by providing a tooling rather than going through the whole codebase and apply LayoutKind.Explicit with plenty of FieldOffsets

Padded is born

The easiest and the best way of addressing cross cutting concerns in your .NET apps I’ve found so far is Fody. It’s a post compiler/weaver based on the mighty Mono.Cecil library. The tool has a decent documentation, allowing one to create a even quite complex plugin in a few hours. Because of this advantages I’ve used it already in RampUp but wanted to have something, which can live on its own. That how Padded was born.

Pad me please

Padded uses a very simple technique of adding a dozen of additional fields. According to the test cases provided, they are sufficient enough to provide enough of space to prohibit overlapping with another object in the same cache line. All you need is to:

  1. install Padded in your project (here you can find nuget) in a project that requires padding
  2. declare one attribute in your project:
    namespace Padded.Fody
    {
    public sealed class PaddedAttribute : Attribute { }
    }
    
  3. mark the classes that need padding with this attribute.

Summary

Marking a class/struct with one attribute is much easier than dealing with its layout using .NET attributes, especially, as they were created not for this purpose. Using a custom, small tool to get the needed result is the way to go. That’s how & why Padded was provided.

StructLayoutKind.Sequential not

If you want to write a performant multi threaded application which actually is an aim of RampUp, you have to deal with padding. The gains can be pretty big, considering that the whole work with threads mean, that you need to give them their own spaces to work in.

False sharing

False sharing is nothing more than two or more threads trying to use memory that’s mapped to a single line of cache. The best case for any thread is to have their own memory space separated & by separation I mean having enough of padding on the right and on the left, to keep the spaces of two threads without any overlapping. The easiest way is to add additional 64 bytes (the size of a cache line) at the end and at the beginning of the struct/class to ensure that no other thread will be able to allocate memory close enough. This mechanism is called padding.

Padding

The easiest way to apply padding is applying StructLayoutAttribute. If StructLayoutKind.Sequential is used, then adding 4 Guid fields at the beginning and 4 Guid fields at the end should work just fine. The size of Guid is 16 bytes which give us needed 64 bytes. A harder way of doing it is using StructLayoutKind.Explicit as it requires to add FieldOffsetAttribute to every field of the structure/class, explicitly stating the offset in the memory. With this approach, it’s easy to start with 64 and leave some space at the end of the class.

Problem

StructLayoutKind.Sequential works perfectly. Almost. Unfortunately if any field has type that is not Sequential or Explicit CLR will simply ignore the sequential requirement and silently apply automatic layout ruining the padding. This is a regular case, all classes use Auto by default. Unfortunately it leaves the developer with the need of applying the fields offsets manually.

Solution

As I need this padding behavior for RampUp, I’m creating a small Fody weaver plugin called Padded which will automatically calculate offsets (possibly with some memory overhead) for any class/struct marked with a proper attribute. Hopefully, it will be useful not only for RampUp but for more, performance oriented projects