One of the most powerful aspects of git is it’s simplicity. One can easily read the object chapter of the Git book during one afternoon and learn, that git stores nothing more than snapshots of the current state of the items added to the repository. If an item is repeated in multiple commits with no changes, it’s referred under the same SHA1 value and it doesn’t have to be stored twice. This decision is well explained by Linus in here. The major point of this explanation is that having an algorithm finding changes, narrow enough and well described like method declaration moved, etc. can be hard and costly. To store a snapshot of the state is much easier and let’s you run your algorithms much later. This allows algorithms evolutions working on the unmodified version of the state at any time.
What git does is storing the every state you commit. The commit object contains a reference to a tree object, which consists of other objects. This results in storing state of any commit in the entire repository history. This means that git never overrides the state. All it does is adding more and more with pointers to the parent states/commits. This allows you to run any tool/algorithm through the entire history of any branch.
This considerations rooted in the git repository design can imply following paradigms to modelling.
State driven modelling
It’s simple to store state. All you’ve got to do is to serialize all the data and put into the store, but… how many times you’ve written a system which performs updates? What about the earlier state? Is it preserved? Or maybe it’s overridden? I can hope that previous values are stored in a some kind of an audit log, but it’s an audit log, not your previous state, isn’t it? It’s not the same. Nathan Marz is discussing the fragility of updates in his talk here. Maybe storing a new state with a link to the previous one (no audit log, just the old value of the state) isn’t that bad after all.
The second take on modelling would be embracing the changes with ___Changed events. You know a property/getter changed it’s value and it’s good to audit it. Unfortunately it can be met in solutions which requires audit logs. Storing a common ‘name – old value – new value‘ tuple is easy. It may be not that simple to deal with any domain changes, or get your new algorithm run through the state of a given entity through entire history but it’s easy. I’d consider it a poor man solution for a person which doesn’t want to invest his/her time in learning the domain. One can audit anything with this kind of paradigm. It’s all text after all, isn’t it?
The last take is event sourcing capturing the business events, which applied onto the previous state lead to the next one. This is also mentioned by Linus, when he talks about clever algorithms calculating perfect deltas. To get one, to get the event and the transition/change it emits when applied, it takes a lot of investment into understanding the domain. I can imagine that perfect events for git repository of C# project would contain events like:
- method renamed
- method moved
- functionality added
Of course it is/may be impossible to provide this kind of information retrieval but it shows the way, that well understood and enriched with right events’ types domain can be minimalistically described with a set of events.
Recently I’ve been moving around the topic of a deployment. Imagine a situation you’re being given a set of scripts, or script like objects used to deploy a set of applications. The so-called scripts are from the very basic like create-directory to complex, rooted in an organization infrastructure and tooling. Additionally, some of them are defined as groups of other scrips. For example, installing an application service starts with creation of a directory, then binaries are copied and finally, the service is registered.
The scripts are not covered with tests, but are hardened by years of successful usage. One could consider to rewrite them totally, and provide a full blown set of tests. This may be hard, as you throw away all the knowledge hidden behind scripts. Remember, that there were big companies that are no longer here, take Netscape as example.
I’ve spent quite a while about considering chef, PowerShell, Pupper even the msbuild with its tasks. What helped me to make up my mind was the famous Blue Book. Why not to consider a set of scrips as a bounded context? Just take a look at the picture provided by Martin Fowler here. Wrap all the older scripts in a context bubble providing mapping, mostly intellectual, to all the terms that are needed to be known outside. It’s more than wrapping all old scrips with an interface. There is a need of a real mapping with a glossary to let people which do not want to leave the bubble now exist in it for a while. What tool will be used for the new bounded context communicating with the other? That’s an open question. I’ll try to choose the best tool, with good enough test support. The only real requirement is the ability to provide the mapping to the old deployment tools’ context.
If you want to learn more, just take a loot at the great Eric Evan’s videos under this link: http://dddcommunity.org/?s=four+strategies
In the previous posts a simple mechanism of storing information needed for operation idempotence was introduced. A simple hash table, which state is transactionally saved with the state of object onto which the send operation was applied. How about receiving operations out of order? What if infrastructure (for instance, messaging system) will pass one operation earlier than the second, which in reality occurred earlier?
It’s time to make it explicit and start calling elements in the DDD manner. So for sake of reference, the object considered as the subject of an operation is an aggregate root. The operation is of course a message. The modeling assumes using the event sourcing as a storage for aggregates’ states.
Assume, that the aggregate, which the command is sent to, has a property called Version, incremented with each event applied on. Assume then, each command contains a version number, which is supposed to be equal to the aggregate’s version. If, during dispatch, these two values are different, an exception is thrown and command do not change the state of the aggregate. It’s a simple optimistic concurrency implementation, allowing discarding out of order commands sent to an object.
To make it more interesting, consider a sharded system, where specific aggregates are stored by different nodes (but for each aggregate there is one node where it is stored). An aggregate’s events (state changes) have to be propagated across all the nodes/shards in the same idempotent manner as commands are sent to aggregates. It’s easy to apply hashtable for each node and with using the very same key: aggregateId with version but it would mean storing all the pairs of aggregate identifiers with their versions, which could possibly bring down each of your nodes (or make you use GBs of memory). Can the trivial fact, that version is increased with every event on the aggregate, could be used for some optimization? You’ll see in the next entry.