Web API caching get wrong

I read/watch a lot of stuff published on the infoq site. I enjoy it in the majority of cases and find it valid. Recently I read an article about Web APIs and Select N+1 problem and it lacks the very basic information one should provide when writing about the web and http performance.
The post discusses structuring your Web API and providing links/identifiers to other resources one should query to get the full information. It’s easy to imagine that returning a collection of identifiers, for example ids of the books belonging to the given category can bring more requests to your server. A client querying over books will hit your app one by one performing from a load test to a fully developed DOS. The answer to this question is given in following points:

  • Denormalize and build read models
  • Parallelising calls
  • Using Async patterns
  • Optimising threading model and network throttles

What is missing is
the basic http mechanism provided by the specification: cache headers and ETags. There’s no mention about properly tagging your responses to allow return 304 if the client asks for data that didn’t change. The http caching, its expiration are not mentioned as well. Recently Greg Young posted a great article about leveraging http caching. The best quote summing the whole take on it from Greg’s article would be:

This is often a hard lesson to learn for developers. More often than not you should not try to scale your own software but instead prefer to scale commoditized things. Building performant and scalable things is hard, the smaller the surface area the better. Which is a more complex problem a basic reverse proxy or your business domain?

Before getting into fancy caching systems, understand your responses, cache forever what isn’t changing and ETag with version things that may change. Then, when you have a performance issue turn into more complex solutions.

UPDATE:
For sake of reference, the author of the Infoq post reponded to my tweet in here.

The expert real drama

You’re familiar with The expert for sure. It’s popular across all developers. Many of my colleagues find it funny to watch the misunderstanding between an illogical business and a logical developer. There is a real drama hiding in the dark corners of this short. The expert’s drama being sold by its own manager/leader as a universal toolish expert who will deliver. If a manager/leader needs a yes-man, it’s easy to get one. They are cheap. But if one insists that a person skilled in a technical direction, with some intellect on the board will ok or yes everything which is on the table, that’s a serious misunderstanding. The expert have the following choices:

  1. say no which results in being persecuted
  2. say yes and live with this dissonance
  3. resign

If you ever be in the expert position from the movie, I advise the third option. I made the wrong choice a few years ago and this short brought back some memories.

I should’ve used EventStore

One of the most important features of EventStore is an ability to ask the questions like they were asked in the past. You don’t have to rerun manually all the stored information to repartition them, or to aggregate them. All you’ve got to do is to write a new projection which will run from the very beginning (almost always: take a loot at scavenging) till now.
There’s no question you should’ve asked, which cannot be added later, with no mental overhead of manually rerouting data through the pipeline once again. Nice:)

When business owner does not own

One of the most worrying scenarios for having a business owner is a business owner who does not care. This may manifest in various ways:

  • a delay in communication – you receive responses for your emails after days, weeks, when the context is already gone
  • a pressure (even positive) to deliver varying over time
  • a knowledge about project has to be refreshed, the basic use cases are being forgotten

The result is a semi-finished, almost-released product. Even if a team delivering the product cares, having no business verification of their ideas may ruin the project. This state may be a result of personal reasons, like not caring at all. More possible this situation is rooted at organizational issues, which result in switching long-/midterms without pushing down the reasons behind the decisions, keeping people in the zone of unknown. Despite the reason, doing a job with value deprecating in time, without final DONE isn’t good for your team morale. That’s for sure.

Event sourcing and failure handling

Currently I’m workingwith a project using event sourcing as its primary source of truth and the log in the same time (a standard advantage). There are some commands, which may throw an exception if the given condition is not satisfied. The exception propagates to the service and after transformation is displayed to the user. The fact of throwing the exception is not marked as an event. From the point of consistency it’s good: an event isn’t appended, the state does not change, when an exception occurs. What is lost is a notion of failure.
A simple proposal is to think a bit more before throwing an exception and ending a command with nothing changed. One may append a ThisCriticalCommandFailedEvent with nothing but the standard event headers (like time, user performing command, etc.) or something with a better name and return a result equal to the exception thrown. The event can be used later, when you want to analyze failures of executing commands.

Embracing domain leads solution towards event oriented design

One of the most powerful aspects of git is it’s simplicity. One can easily read the object chapter of the Git book during one afternoon and learn, that git stores nothing more than snapshots of the current state of the items added to the repository. If an item is repeated in multiple commits with no changes, it’s referred under the same SHA1 value and it doesn’t have to be stored twice. This decision is well explained by Linus in here. The major point of this explanation is that having an algorithm finding changes, narrow enough and well described like method declaration moved, etc. can be hard and costly. To store a snapshot of the state is much easier and let’s you run your algorithms much later. This allows algorithms evolutions working on the unmodified version of the state at any time.
What git does is storing the every state you commit. The commit object contains a reference to a tree object, which consists of other objects. This results in storing state of any commit in the entire repository history. This means that git never overrides the state. All it does is adding more and more with pointers to the parent states/commits. This allows you to run any tool/algorithm through the entire history of any branch.
This considerations rooted in the git repository design can imply following paradigms to modelling.

State driven modelling
It’s simple to store state. All you’ve got to do is to serialize all the data and put into the store, but… how many times you’ve written a system which performs updates? What about the earlier state? Is it preserved? Or maybe it’s overridden? I can hope that previous values are stored in a some kind of an audit log, but it’s an audit log, not your previous state, isn’t it? It’s not the same. Nathan Marz is discussing the fragility of updates in his talk here. Maybe storing a new state with a link to the previous one (no audit log, just the old value of the state) isn’t that bad after all.

Changed events
The second take on modelling would be embracing the changes with ___Changed events. You know a property/getter changed it’s value and it’s good to audit it. Unfortunately it can be met in solutions which requires audit logs. Storing a common ‘name – old value – new value‘ tuple is easy. It may be not that simple to deal with any domain changes, or get your new algorithm run through the state of a given entity through entire history but it’s easy. I’d consider it a poor man solution for a person which doesn’t want to invest his/her time in learning the domain. One can audit anything with this kind of paradigm. It’s all text after all, isn’t it?

Event sourcing
The last take is event sourcing capturing the business events, which applied onto the previous state lead to the next one. This is also mentioned by Linus, when he talks about clever algorithms calculating perfect deltas. To get one, to get the event and the transition/change it emits when applied, it takes a lot of investment into understanding the domain. I can imagine that perfect events for git repository of C# project would contain events like:

  • method renamed
  • method moved
  • functionality added

Of course it is/may be impossible to provide this kind of information retrieval but it shows the way, that well understood and enriched with right events’ types domain can be minimalistically described with a set of events.

Git repository as a graph

One of the greatest misunderstanding of git is trying to map it 1-1 to SVN. It’s easy to follow this fallacy when one starts migrating from SVN and all he/she fears the most is loosing the precious branches and all the rituals connected with them. Let’s state from the beginning:

Don’t be stupid. Learn Git. Stop trying mapping all SVN stuff to the new environment.

A simpler abstraction of git repo can be started from a graph of commits. Consider each git commit object (see here if you have no idea what kinds of object are present in git) connected with its parents by an oriented graph edge. A git repository can be considered then as a graph. Under normal circumstances that graph is a directed acyclic graph:

  • directed – a parent of a commit is a head of an edge and the commit is its tail
  • acyclic – as there are no returns to the earlier states, no cycles in the history

What is a branch then? It’s a named pointer to a given commit. It’s not a folder or any existing chain of commits. It’s a pointer sliding in time to the children commits. Do not attach more meaning to it.

What is a fast forward merge then? It’s a merge possible only if one of two commits being merged can be moved along the graph path to the second one.

One can try to preserve a linear history over one branch but as a directed acyclic graph defines only partial order, that may be impossible. Remember tagging important moments in history of a repository. Once the branch pointer was moved away, there may be no turning back – merge commit doesn’t have distinguishable parents.

To sum up. Don’t use improper abstractions over git. A repository should be considered as a directed acyclic graph. Use tags as breadcrumbs to point important events in history.
And use git for Linus sake!