Events on the Outside versus Events on the Inside

Recently I’ve been revisiting some of my Domain Driven Design, CQRS & Event Sourcing knowledge and techniques. I’ve supported creation of systems with these approaches, hence, I could revisit the experiences I had as well. If you are not familiar with these topics, a good started could be my Feed Your Head list.

Inside

So you model you domain with aggregates in minds, distilling contexts and domains. The separation between services may be clear or a bit blurry, but it looks ok and, more important, maps the business well. Inside a single context bubble, you can use your aggregates’ events to create views and use the views when in need of data for a command execution. It doesn’t matter which database you use for storing events. It’s simple. Restore the state of an aggregate, gather some data from views, execute a command. If any events are emitted, just store them. A background worker will pick them up to dispatch to a Process Manager.

Outside

What about exposing you events to other modules? If and how can another module react to an event? Should it be able to build it’s own view from the data held in the event? All of these could be sum up in one question: do external events match the internal of a specific module? My answer would be: it’s not easy to tell.

In some systems, these may be good. By the system I mean not only a product, but also a team. Sometimes having a feed of events can be liberating and enabling faster grow, by speeding up initial shaping. You could agree to actually separate services from the very start and verify during a design, if the logical complexity is still low. I.e., if there is not that much events shared between services and what they contain.

This approach brings some problems as well. All the events are becoming your API. They are public, so now they should be taken into consideration when versioning your schemas. Probably some migration guide will be needed as well. The bigger public API the bigger friction with maintaining it for its consumers.

Having this said, you could consider having a smaller and totally separate set of events you want to share with external systems. This draws a visible line between the Inside & the Outside of your service, enabling you to evolve rapidly in the Inside. Maintaining a stable API is much easier then and the system itself has a separation. This addresses questions about views as well. Where should they be stored originally. The answer would be to store properly versioned, immutable views Inside the service, using identifiers to pass the reference to another service. When needed, the consumer can copy & transform the data locally. A separate set of events provides ability to do not use Event Sourcing where not needed. That kind of options (you may, but don’t have to use it) are always good.

For some time I was an advocate of sharing events between services easily, but now, I’d say: apply a proper choice for your scenario. Consider pros and cons, especially in terms of the schema maintainer tax & an option for not sticking to Event Sourcing.

Inspirations

The process of revisiting my assumptions has been started by a few materials. One of them is a presentation by Andreas Ohlund, ‘Putting your events on a diet’, sharing a story about deconstructing an online shop into services. The second are some bits from A Decade of DDD, CQRS, Event Sourcing by Greg Young. The last but not least, Pat Helland’s Data on the Outside versus Data on the Inside.

Single producer single consumer optimizations

The producer-consumer relationship is one of the most fundamental cooperation patterns. Some components produce values, issues requests and some consume/handle them. Depending on the number of components at the end of this dependency it’s called ‘single/multi producer single/multi consumer’ relationship. It’s important to make this choice explicit, because as with every explicit choice, it enables some optimizations. I’d like to share some thoughts o the optimizations taken in the single consumer single producer scenario in the RampUp library provided by OneToOneRingBuffer.

The behavior of ring buffers in RampUp is ported from Java’s Agrona. They provide a queue that enables reading sequentially on the consumer side. The reasoning behind it is that sequential reads are CPU friendly, so that consumer can process messages much quicker. For ManyToOneRingBuffer the production part is quite complex. It proceeds as follows:

  1. check against the consumer position, is there enough of space
  2. allocate a slot in the ring (this is done with Interlocked operations, in a loop, may take a while)
  3. write a header in an ordered way (using volatile)
  4. put data
  5. write the header again marking the message as published

This brings a lot of unneeded work for a single producer. When considering a single producer, there’s nothing to compete with. The only check that needs to be made is that the producer does not overlap with the consumer. So the algorithm looks as follows:

 

  1. check against the consumer position, is there enough of space
  2. put data
  3. write the header again marking the message as published
  4. write the tail value for future writes

Removal of Interlocked and lowering the number of Volatile operations can improve the producer performance greatly (less synchronization).

 

If you wanted to compare these two on your own, here you are: ManyToOne and OneToOne.

Happy producing (and consuming).

Data has no format

I need to be able to store 1GB of JSON

I’d like to push XML 100 MB/s to this Azure blob

I need to log this data as CSV

Statements like this are sometimes true, but in the majority of cases the format is not given and is a part of designing your architecture/application. Or redesigning if needed. Selecting a proper format can lower the size of your data, increasing the throughput of your system, if a medium like a disk or a network is saturated. That’s why systems like Apache Arrow or Google’s Dremel use their own formats. That’s why you may consider using the protobuf-net serialization for EventStore, disabling it build in v8 projections and lowering size of events at the same time. For low latency systems you can choose the new library Simple Binary Encoding. That’s why sometimes storing data in another format is simply better. I’ve written a blog post Do we really need all these data tranformations and it doesn’t state something opposite. It’s all about making a rational and proper choices of the storage format and taking into consideration different aspects of it and its influence on your system. With this one decision you might improve your system performance greatly.

Shared Resources in TeamCity

It’s a common requirement that a set of your tests depends on some resources. It might be a database or an Azure Storage account. It’s possible that instead of providing TeamCity with an administrator account (giving a subscription access for Azure) you’d prefer to have a limited preexisting set or resources like databases or Azure Storage accounts that are leased for a build time by a particular agent. As soon as build is finished the resource would go back to the pool to be leased for another build.

Fortunately TeamCity has a built in ability for this purpose called Shared Resources. This can be defined on any project level and used as a parameter of any build configuration below. Shared Resources feature provides you with all the capabilities mentioned before, removing all the burden of managing a resource pool. In the same way a build leases an agent, an agent leases a shared resource. Nice, simple, easy.

A pointer to a generic method argument

Let’s consider a following method signature of an interface taken from a RampUp interface.


bool Write<TMessage>(ref Envelope envelope, 
    ref TMessage message, IRingBuffer bufferToWrite) 
    where TMessage : struct;

It’s a fairly simple signature, enabling to pass a struct of any size using just a reference to it, without copying it. Now let’s consider the need of obtaining a pointer to this message. Taking a pointer could be needed for various reasons. One could be getting fields by offset, another could be using memcpy for copying the value to any given address. Is it possible to get this pointer in C# code?

No pointers for generic parameters

Unfortunately, you can’t do it in C#. If you try to obtain a pointer to a generic parameter, you’ll be informed about the compiler error. If you can’t do it in C#, is there any other .NET language one could use to get it? Yes, there is. It’s the foundation of .NET programs, the MSIL itself and if it’s MSIL, it means emitting code dynamically.

Ref looks like a pointer

What is a reference to a struct? It looks like a pointer to me. What if we could load it and just assume that it is a pointer? Would CLR accept this program? It occurs that it would. I won’t cover the whole implementation which can be found in here, but want to accent some points.

  • CLR uses the argument with index 0 to passing this. If you want to load a field you need to use the following sequence of operations:
    • Ldloc_0; // load this on the stack
    • Ldfld, “Field1” // pops this loading the value named “Field1” on the stack
  • For Write method, getting a pointer to a message is nothing more than calling an op code: Ldarg_2. As the struct is passed by reference, it can be treated as a pointer by CLR and it will.

I encourage you to download the RampUp codebase and play a little bit with an emitted implementation of the IMessageWriter. Maybe you’ll never need to take the pointer to a generic method parameter (I did), but it’s a good starter to learn a little about emitting code.

Roslyn coding conventions applied

Roslyn is a ‘compiler as a service’ provided for both VisualBasic.NET & C#. It has a thriving community of people providing new features for .NET languages. One of the most important parts of this community is a guideline how to contribute, which defines basic rules for coding and issuing pull requests. The most important part, not only from Roslyn perspective, but as general .NET guidance are Coding Conventions.

Avoid allocations in hot paths

This is the rule, that should be close to every .NET developer heart, not only these that work on a compiler. It’s not about ‘premature optimization’. It’s about writing performant code that actually can sustain its performance when executing its hot paths in majority of the requests. Give it a try, and when writing some code next time (today?) have this rule in mind. Awareness of this kind, may result in having no need for profiling your production system or making a dump just to know that allocating a list for every cell of two dimensional array wasn’t the best approach.

What’s your hot path

That’s a good question that everyone should answer on their system basis. I asked this question a few months ago for my RampUp library:

what’s the hot path for a system using message passing?

The answer was surprisingly obvious: the message passing itself. EventStore, using a similar approach uses classes for message passing. This plus every other object creates some GC pressure. Back then, I asked myself a question, is it possible to use structs for internal process communication and come up with a good way of passing them? My reasoning was following: if I remove the GC pressure from messages, then I remove the hottest path of allocations and this can greatly improve stability of my system. Was it easy? No it wasn’t as I needed to emit a lot of code and discover some interesting properties of CLR. Did it work? Yes, it did.

Next time when you write a piece of code or design a system keep the hot path question in your mind and answer it. It’s worth it.

The art of benchmarking

I’ve been told that Akka can process 50 millions of messages per second on a laptop. This isn’t the number you hear every day, even if you write performance focus applications.

I’ve been recently optimizing my RampUp library and I know that it can perform well, but reaching 50 millions of messages on my 4 hardware cores? That would be a hard thing to do. Possible, maybe, if the test was designed in a way that it groups cores in some way… The current official number is 10 millions msg/s on my laptop and the test uses two producers trying to flood a single consumer. It’s a multi producer single consumer scenario. But let’s go back to the Akka benchmark.

The best performance marked with ‘single machine’ phrase is this. It actually was able to process 48 millions of messages on a single machine! That’s great. Let’s take a look what kind of machine is that

 

  • Processor: 48 core AMD Opteron (4 dual-socket with 6 core AMD® Opteron™ 6172 2.1 GHz Processors)
  • Memory: 128 GB ECC DDR3 1333 MHz memory, 16 DIMMs
  • OS: Ubuntu 11.10
  • JVM: OpenJDK 7, version “1.7.0_147-icedtea”, (IcedTea7 2.0) (7~b147-2.0-0ubuntu0.11.10.1)
  • JVM settings: -server -XX:+UseNUMA -XX:+UseCondCardMark -XX:-UseBiasedLocking -Xms1024M -Xmx2048M -Xss1M -XX:MaxPermSize=128m -XX:+UseParallelGC
  • Akka version: 2.0
  • Dispatcher configuration other than default: 
    parallelism 48 of fork-join-exector
    throughput as described

It’s not a laptop. It’s not a usual single machine. It’s a quite powerful server with a special dispatcher used to get this performance.

I’m not saying, that it’s bad to use good hardware for your tests. I’m not trying to defend RampUp performance, as it does not compete with Akka – it’s for different purposes. I’m just saying that providing benchmarks, shouldn’t be focused on providing number only. There is so much more information needed to give the whole background for a test. Again, the way of communicating number depend if one wants to sell sth providing numbers or provide real results. If you want the second, choose your words wisely.