Anomalies: Listening to your secondaries with Service Fabric

This is the second post in the series describing different anomalies you may run into using modern databases or other storage systems.

Just turn this on

This story has a similar beginning as the last one. It starts when one of developers working on a project built with ServiceFabric finds this property ListenOnSecondary and enables this feature. After all, if now every node in my cluster can answer queries sent by other parts, that should be good, right? I meant, it’s even more than good! We’re faster now!

Replication

To answer this, we need to dive a bit deeper . We need to know how Service Fabric internal storage works. Service Fabric provides a clustered storage. To ensure that your data are properly copied, it uses a replication protocol. In every single moment, there’s only one active master, the copy accepting all the write and read operations, replicating its data to all the secondary replicas. Because of various reasons, replicas that data are copied to, can be not always up to date. To give an example, imagine that we sent three commands to Service Fabric to write different pieces of data. Let’s take a look at the state

  • master: cmd1, cmd2, cmd3
  • replica2: cmd1, cmd2,
  • replica3: cmd1, cmd2, cmd3

Eventually, replica2 will receive the missing cmd3, but depending on you hardware (disks, network), there can be a constant small lag, where it has some of the operations not replicated yet.

Now, after seeing this example of how replication works and noticing that the state on replicas might be occasionally stale, can we turn on ListenOnSecondary that easily?

It depends (TM)

There is no straight answer to this. If your user first calls an action that might result in a write, and then, almost-immediately, queries for the data, they might not see their writes, which are replicated with some lag.

If your writes are not followed with reads, and you always cheat by updating the view for the user as it would be, if data were read from the store, then, you might not run into a problem.

Unfortunately, before switching on this small flag, you should think about concerns I raised above.

Wrapping up

Unfortunately for us, we’ve been given a very powerful option, configured with a single call to a method. Now, we can enable reading potentially stale data to gain bigger query throughput. It’s still up to us, whether we want to do it and whether we can do it, being given the environment and the architecture our solution lives in.

Anomalies: Snapshot Isolation

This post starts a short series about different anomalies you may run into using modern databases or other storage systems.

Snapshot Isolation to the rescue

Imagine a system that frequently deals with database locks and transactions that run much too long because of the locks being taken. Imagine that someone applies a magic fix, simply changing the isolation level to the snapshot isolation. Finally, this app is working, throwing an exception from time to time. Owners are happy, until they find, that somehow users are able to write more data than are allowed to write. The investigation starts.

What are you made of, Snapshot Isolation?

If you wonder what Snapshot Isolation means, the description is quite simple. Instead of having locks on rows and checking whether or not a row can be locked/updated,etc, every row is now versioned. To simplify, imagine that a date is added to every row when it’s modified somehow. Now, whenever a transaction starts, it has a date assigned, that creates a boundary for seeing newer records. Consider the following example:

  1. BEGIN TX1
  2. BEGIN TX2
  3. TX2: INSERT row1 INTO tableA
  4. COMMIT TX2
  5. TX1: SELECT * from tableA

the last statement won’t return row1 as it was committed after transaction TX1 started. Simple and easy, right? You can read only rows that were committed before you. What can go wrong then?

Write skew

Now imagine a blogging service, that allows only 5 posts per one user. Let’s consider a situation when a user has two employees entering posts for him/her. Additionally, let’s assume that there are already 4 posts.

  1. BEGIN TX1
  2. BEGIN TX2
  3. TX1: SELECT COUNT(*) FROM Posts returns 4
  4. TX2: SELECT COUNT(*) FROM Posts returns 4
  5. TX1: INSERT post5a INTO Posts
  6. TX2: INSERT post5b INTO Posts
  7. COMMIT TX1
  8. COMMIT TX2

As you can see, both transactions read the same number of posts: 4 and were able to add one more. Unfortunately, for the owners of the portal, now, their users know, that by issuing multiple requests at the same time, they can do much much more without paying for additional entries.

This error is called the write skew.

Mitigations

The first mitigation you might think about is a simple update on the number of posts already published. Once a conflicting write is found, the database engine will kill the transaction. Another one could be replacing a record with itself. This still, qualifies as a conflict, and again, will kill the transaction committed afterwards. Are there any other tools?

Yes they are, but they are not available in every database. There’s a special isolation level called Serializable Snapshot Isolation (SSI) that is less than 10 years old. It’s capable of automatically checking whether or not two transactions overlap in a way, that one could impact another. One of the databases that is capable of doing it, is PostgreSQL. Another one is the open source Spanner clone, called CockroachDB. Interestingly it defaults to SSI as it’s described in here.

Wrapping up

As always, don’t apply things automagically, especially, if you deal with isolation levels. If you select one, learn how does it work and what anomalies are possible. When thinking about Snapshot Isolation, consider databases that support you with Serializable Snapshot Isolation, which removes the burden of updating rows “just-in-case” and can actually prove correctness of your operations.

DevConf 2017

The first, 2017 edition of DevConf has ended. Both, on the social level and the content level it rocked. Also, this was the very first time I’ve ever given a presentation in English.

Top 3 talks

I haven’t seen all of the talks as I was preparing for mine. Frankly speaking, it’s quite hard to both enjoy a talk and be stressed before yours so I chose a third option and drank a few cups of delicious coffee (no stress, no attendance, just coffee). If I had to choose top 3 talks (in no particular order) these would be:

What is .NET Standard? by Adam Ralph

An interesting presentation showing the mess before .NET Standard and the beauty of the common interface to bind all the platforms together. Good jokes about graphs, some numbers and, the last but not least, insightful journeys into type-forwarding that enabled this whole thing to work.

Kudos to Adam.

Domain Driven Design: The Good Parts by Jimmy Bogard

A very interesting presentation of removing bad parts from DDD, making it focused on things that matter the most, which is…. (you better watch the presentation). It’s worth to add that it was beautifully and naturally storyfied with real projects Jimmy was involved in.

Kudos to Jimmy.

“Cargo Cults” in Building Modern Software Systems by Sebastian Gębski

This is the presentation where some people could feel offended. Or terrified. Or both. Sebastian dissected industry standards and mechanisms showing, how rotten it is on multiple levels. It’s a really heavy topic and before this presentation I didn’t know things like “Point of View” and others. A very eye opening presentation.

Kudos to Sebastian.

 

Lockstitch, a mechanical stitch made by a sewing machine

TL;DR

This post sums up my work on SewingMachine and introduces the new project, based on Service Fabric called Lockstitch.

Whys, reasoning and more

Due to various reasons including some NDA stuff that I cannot share and after lots of thinking about the way I could push Sewing Machine further, it looks that this project won’t receive much more attention from me. Just to get it clear, this is not related to any “I don’t have time for OSS now” or “omg, nobody likes my project”. It’s simple calculations followed by a few discussions about directions where this project could head.

The nature hates vacuum and at the same time when claiming that SewingMachine is almost dead, I want to bring Lockstitch to the table. It’s again about Service Fabric, it’s again about performance, it’s again about distributed systems done right. Only better and on a different level. Lockstitch is aiming to work with the lowest fabric component, called replicator. The overall goals, as I mentioned are the same.

If you want to see the where is it heading, a list of issues should provide you all the needed information.

It’s important to mention Tomasz Masternak, who is a co-author of Lockstitch.

Summary

Good ideas, don’t die, they reappear in a different shape. This is why Lockstitch could be treated as SewingMachine++.

Await Now or Never

Intro

This post is a continuation of implementing a custom scheduler for your orchestrations. We saw that the Delay operation is either completed or results in a never ending task, that nobody ever completes. Could we make it easier and provide a better way for delaying operation?

Complete or not, here I come

The completed Delay operation was realized by


Task.CompletedTask

This is a static readonly instance of a task that is already completed. If you need to return a completed task, because the operation of your asynchronous method was done synchronously, this is the best way you can do it.

For cases where we don’t want continuations to be run, we used:


new TaskCompletionSource<object>().Task

which of course allocates both, the TaskCompletionSource instance and the underlying Task object. It’s not that much, but maybe, as there are only two states of the continuation: now or never, we could provide a smaller tool for this, that does not allocate.

Now OR Never

You probably know, that you might create your custom awaitable objects, that you don’t need to await on Tasks only. Let’s take a look at the following class


public sealed class NowOrNever : ICriticalNotifyCompletion
{
  public static readonly NowOrNever Never = new NowOrNever(false);
  public static readonly NowOrNever Now = new NowOrNever(true);

  NowOrNever(bool isCompleted)
  {
    IsCompleted = isCompleted;
  }

  public NowOrNever GetAwaiter()
  {
    return this;
  }

  public void GetResult() { }

  public bool IsCompleted { get; }

  public void OnCompleted(Action continuation) { }

  public void UnsafeOnCompleted(Action continuation) { }
}

This class is awaitable, as it provides three elements:

  1. IsCompleted – for checking whether it was finished (fast enough or synchronously to do not build whole machinery for an asynchronous dispatch)
  2. GetAwaiter – to obtain the awaiter that is used to create the asynchronous flow
  3. GetResult

Knowing what are these parts for, let’s take a look at different values provided by NowOrNever static fields

NowOrNever IsCompleted OnCompleted/ UnsafeOnCompleted
Now true no action
Never false no action

 

As you can see, the completion is never called at all. For the Never case, that’s what we meant. What about Now? Just take a look above. Whenever IsCompleted is true, no continuations are attached, and the code is executed synchronously. We don’t need to preserve continuations as there are none.

Summary

Writing a custom awaitable class is not a day-to-day activity. Sometimes there are cases where this could have its limited benefit. In this NowOrNever case, this allowed to skip the allocation of a single task, although, yes, the created async state machine takes probably much more that a single task instance.

Implementing a scheduler for your orchestrations

TL;DR

We’ve already seen here and here that with async-await one could easily sketch an orchestration/saga for any process that should be both, robust and resilient. It’s time to take a look how a scheduler for such a process could be implemented.

Delay with no Task

Usually, when we want to delay an action in an asynchronous flow, we use Task.Delay. This method schedules a continuation with the rest of our code, to be executed after the specified delay. The usage is as simple as:


await Task.Delay(TimeSpan.FromSeconds(1.5));

This is fine, when we want to postpone an action for a few seconds, but what in case of processes that want to be frozen for days? How could you implement it?

First, let us rephrase the delay to a method that is provided by the base orchestration class (you can always have a base, can’t you?).


await this.Delay(TimeSpan.FromSeconds(1.5));

With this assumption, we can move forward and take a look at a possible Delay implementation.

Delay for Orchestrations

The whole idea of this orchestration is based on snapshoting its changes as events and make them replayable. In other words, if a failure occurs, the orchestration process should be resurrected on another node with no changes in the flow. This makes implementation a bit trickier, but is needed for providing strong foundations for our processes. Let’s take a look at Delay possible implementation.


protected Task Delay(TimeSpan delay)
{
  var date = GetDateTimeUtcNow();
  var scheduleAt = date + delay;

  ScheduledAt existingDelay;
  if (TryPop(out existingDelay))
  {
    if (existingDelay.Value > context.DateTimeUtcNow())
    {
      EndCurrentExecution();
      return new TaskCompletionSource<object>().Task;
    }

    return Task.CompletedTask;
  }

  if (scheduleAt <= date)
  {
    return Task.CompletedTask;
  }

  Append(new ScheduledAt(scheduleAt));
  EndCurrentExecution();
  return new TaskCompletionSource<object>().Task;
}

The first line of this methods calls to GetDateTimeUtcNow. As you can imagine, this gets the UTC current date. It has one additional property though. Do you remember that we need to make this method possible to be executed multiple times with the same effect? This means, that the result of GetDateTimeUtcNow will be recorded and when we, for any reason like the process kill, enter the orchestration again, it will provide the same value. Effectively, it will be now from the first execution of it.

The next step is to calculate the date when the delay should end, where the next execution is ScheduledAt.

We TryPop a prerecorded event. If the orchestration was already active, it left a trace, an event in the history, that we can pop. If there’s an entry, we compare it with the current UtcDateTimeNow. If the orchestrations should wait more we just mark it as one that requires ending execution. Next we return  new TaskCompletionSource<object>().Task
which effectively is a never ending task. This means that any continuation attached by the caller of this method, either explicit or implicit using await won’t be run!

If there was no event and the date that Delay is scheduled for some reason lower then current date, a completed task is returned. Otherwise, an event is added and the current execution is ended with the same pattern: by setting a notification about execution not proceeding any longer and returning a never completing task.

Execution status

The caller responsibility is to gather information whether the orchestration ended or was scheduled for later execution. This is done by awaiting one of the tasks. Either the task of the orchestration itself or a task that is EndCurrentExecution sets the result.


await Task.WhenAny(
orchestration.Execute(),
currentExecutionEndingTask);

Summary

We saw how powerful can be asynchronous flows, especially when connected with optional calling of scheduled continuations. With a simple recording of events we were able to create an orchestration tooling that is easy to use by the end user (programmer), but still provides an interesting and powerful semantics of a time dependent process.

Top Domain Model: the end

TL;DR

This is the end of mini series related to Top Domain Model. Let’s quickly go through all the topics we’ve covered.

Top Domain Models

  1. In I’m temporal we’ve covered one of the most underused techniques in modelling. Adding an explicit temporal dimension to the model.
  2. Reading that I’ve been pivoting all night long brings a lot of different questions about aggregates and modelling them from different perspective, selecting the most useful.
  3. In behaviors, processes and reactions we observed that aggregates on their own are meaningless. Why one would capture all the events if none of them caused a reaction?
  4. The last chapter reminds to not be so implicit and that capturing the same value twice or more sometimes is required to build a meaningful and useful model.

Summary

I hope you enjoyed these loosely connected articles about modelling. As always, it’s not about the most real models, but the most useful ones.