Data has no format

I need to be able to store 1GB of JSON

I’d like to push XML 100 MB/s to this Azure blob

I need to log this data as CSV

Statements like this are sometimes true, but in the majority of cases the format is not given and is a part of designing your architecture/application. Or redesigning if needed. Selecting a proper format can lower the size of your data, increasing the throughput of your system, if a medium like a disk or a network is saturated. That’s why systems like Apache Arrow or Google’s Dremel use their own formats. That’s why you may consider using the protobuf-net serialization for EventStore, disabling it build in v8 projections and lowering size of events at the same time. For low latency systems you can choose the new library Simple Binary Encoding. That’s why sometimes storing data in another format is simply better. I’ve written a blog post Do we really need all these data tranformations and it doesn’t state something opposite. It’s all about making a rational and proper choices of the storage format and taking into consideration different aspects of it and its influence on your system. With this one decision you might improve your system performance greatly.

IS vs HAS relationship for your API

There is an urge of making things automagically. For instance, when you have a DDD Aggregate, one could consider automatic publishing all the commands as the service API. As The Aggregate is a part of a model & a language which you agreed to use, that seems to be a perfect match to be your API. Is it?

IS vs HAS

There is a rule of Composition over inheritance. It says that instead of deriving from different components, one should compose bigger parts from already existing by using them, but no deriving. A good example might be a user and an employee. As an employee, you are given a user in the system. You might try to model it with the derivation in mind. Is an user an employee as well or the other way around? There’s no good answer to this.

You could model it in another way. There an employee, that a user has access to. When a user logs in he/she can access data of an employee is attached to him/her. You can see where is it going. Keep things minimal, use other elements but do not introduce a relation of being something.

A bit abusive allegory

Now ask yourself a question. Is your API using the model or is it the model? In the majority of cases, the interfaces of your API & your model may be aligned, but they are not the same! Even if you publish operations named after a model that you established, you’d like your API to use the model just in case of remodeling the domain. It’s good to automate and do not write much code. On the other hand, it’s good to have proper abstractions separating concerns of two different worlds.

Producer – consumer relationship

In the last post about the RampUp library I covered on of the foundations: IRingBuffer. Now I’d like to describe the contract it fulfills.

Producer consumer

If you take a look at the IRingBuffer you’ll see Write/Read methods. These two are responsible for producing/consuming or writing/reading messages to the buffer in FIFO way. What are the guarantees behind such an interface? What about concurrent accessing this structure?

Multimulti

The easiest approach for distinguishing accessing patterns is considering whether the structure could be accessed by one or more threads. If you consider producer/consumer you’ll see that there are four options:

  1. SPSC – Single Producer Single Consumer – only one thread produces items, and another consumes them
  2. MPSC – Multi Producer Single Consumer – multiple threads may produce items in a safe manner, again there’s a single consuming thread
  3. SPMC – Single Producer Multi Consumer – this could be treated as a distributor of work
  4. MPMC – Multi Producer Multi Consumer – multi/multi, ConcurrentQueue is a good example of it

Unfortunately nothing comes for free. If you want to get multi, you’ll pay the price of handling friction on that end. On the other hand, if you want to design a system where queues provide transport between different parts of the system, you’ll need to enable multiple producers for sure, as there’s going to be more than one system element. If you want to process items in an order they appeared and leave the locking issues, just write a fast single threaded code, a single consumer with a single worker thread would be the way to go. Of course this worker thread may access other queues and produce items for them (hence, multi producer is needed).

The ring buffer implementation in RampUp provides exactly MPSC behavior, as it’s prepared to handle items in order, by a single thread.

Nautral identifiers as subresources in RESTful services

There’s a project I’m working on, which provides a great study of legacy. Beside the code, there’s a database which frequently uses complex natural keys, consisting of various values. As modelling using natural complex keys may be natural, when it comes to put a layer of REST services on top of that structure, a question may be raised: how to model the identifiers in the API, should this natural keys be leaked into the API?

REST provides various ways of modelling API. One of them are REST subresources, which are represented as URIs with additional identifiers at the end. The subresource is nothing more than an identified subpart of the resource itself. Having that said, and taking as an example a simple row with complex natural key consisting of two values <Country, City> how could one model accessing cities (for sake of this example I assume that, there are cities all around the world having the same name but being in different countries and all the cities in the given country have distinct names). How one could provide a URI for that? Is the following the right one?

/api/country/Poland/city/Warsaw 

The API shows Warsaw as the Polish city. That’s true. This API has that nice notion of being easy to consume, navigate. Consider following example:

/api/city/Poland,Warsaw

Now it’s a big uglier, both the country and the city name are at the end. This is a bit different for sure and tells nothing about country accessible under /api/country/Poland. The question is which is better?

Let me abuse a bit the DDD Aggregate term and follow its definition. Are there any operations that can be performed against the city resource/subresource that does not change the state of the country? If yes, then in my opinion modelling your API with resources shows something totally different, saying: hey, this is a city, a part of this country; it’s a subresource and should be treated as a part of the country ALWAYS. Consider the second take. This one, presents a city as a standalone resource. Yes, it is identified by a complex natural key consisting of two dimentions, but this is a mere implementation detail. Once a usual identifiers like int, Guid are introduced the API won’t change that much, or even better, API could accept both of them, using the older combined id for consumers that don’t want to change their usage (easier versioninig).

To sum up: do not leak your internal design either it’s a database design or an application design. Present your user a consistent view grouping resources under wings of transactional consistency.

Feature oriented design got wrong

The fourth link in my google search for ‘feature toggle’ is a link to this Building Real Software post. It’s about not about feature toggles described by Martin Folwer. It’s about feature toggles got wrong.

If you consider toggling features with flags and apply it literally, what you get is a lot of branching. That’s all. Some tests should be written twice to handle a positive and a negative scenario for the branch. The reason for this is a design not prepared to handle toggling properly. In the majority of cases, it’s a design which is not feature-based on its own.

The featured based design is created on the basis of closed components, which handle the given domain aspect. Some of them may be big like ‘basket’, some may be much smaller, like ‘notifications’ reacting to various changes and displaying needed information. The important thing is to design the features as closed components. Once you have it done this way, it’s easier to think about the page without notifications or ads. Again, disabling the feature is not a mere flag thrown in different pieces of code. It’s disabling or replacing the whole feature.

One of my favorite architecture styles, event driven architecture helps in a great manner to build this kind of toggles. It’s quite easy to simply… not handle the event at all. If you consider the notifications, if they are disabled, they simply do not react to various events like ‘order-processed’, etc. The separate story is to not create cycles of dependencies, but still, if you consider the reactive nature of connections between features, that’s a great enabler for introducing toggling with all of advantages one can derive from it with A/B tests, canary releases in mind.

I’m not a fan boy of feature toggling, I consider it as an important tool in architects arsenal though.

 

One deployment, one assembly, one project

Currently, I’m working with some pieces of a legacy code. There are good old-fashioned DAL, BLL layers which reside in separate projects. Additionally, there is a common
project with all the interfaces one could need elsewhere. The whole solution is deployed as one solid piece, without any of the projects used anywhere else. What is your opinion about this structure?

To my mind, splitting one solid piece into non-functional projects is not the best option you can get. Another approach which fits this scenario is using feature orientation and one project in solution to rule them all. An old, the deeper you get in namespace, the more internal you become, is the way to approach feature cross-referencing. So how would one could design a project:

  • /Project
    • /Admin
      • /Impl
        • PermissionService
        • InternalUtils.cs
      • Admin.cs (entity)
      • IPermissionService
    • Notifications
      • /Email
        • EmailPublisher.cs
      • /Sms
        • SmsPublisher.cs
      • IPublisher.cs
    • Registration

I see the following advantages:

  • If any of the features requires reference to another, it’s an easy thing to add one.
  • There’s no need of thinking where to put the interface, if it is going to be used in another project of this solution.
  • You don’t onionate all the things. Now, there are top-bottom pillars which one could later on transform into services if needed.

To sum up, you could deal with features oriented toward business or layers oriented toward programming layers. What would you choose?

Do we really need all these data transformations?

Applications have layers. It’s still pretty common to see an enterprise application being built with layers like DAL, Business Logic (or Domain), Services, etc. Let’s not discuss this abomination itself. Let us rather consider the flow of the data within the application.

SELECT * FROM
That’s where the data are stored. Let us consider a good old-fashioned SQL Server. To get the data from the database you may use ADO (oh no!) or any new ORMs, including the micro ORMs like Dapper or something similar. What you end with is probably some kind of an object, or an object collection. Here’s where you start playing with data.

Mappings
It doesn’t matter whether you’re using Automapper or map the data on your own. For encapsulation purposes or getting an immutable version of an object it’s common to copy its values to a new representation. I know that strings are immutable and will be copied by reference, but you copy them as well.

Services
So you’ve got your data mapped to the right model. Now you can return them from your service. Ooops, it’s a fancy REST service and you translate the very same data again. Now, because it’s a browser asking and you use content negotiation, the data are transformed to JSON.

In onion architectures, you can meet even more transformations between layers, mappings from DTOs to DTOs are quite common. The question, not only from the architecture point of view, but from the performance oriented angle is the same: what are you doing? Why do you want to spend plenty of time to write all these mappings? Why do you want to melt the CPU in never ending mappings? Can you not skip all of these? Why not to store JSON in the database or use a database that supports JSON blobs as a first level citizen (RavenDB, MongoDB) and simply push the content retrieved from the database right to the output stream?

All the thoughts above have been provoked by services I’m creating now. Long story short, they store objects serialized with Google Protocol Buffers. When you access an object from an external system, a service just copies the blob without the deserialization right to the output stream. No deserialization, no allocations, no overhead. Simple and brutally fast.

Next time you come up with an onion design or layers of transformations ask yourself is it worth and if you can pay the price of doing all these mappings.