Git repository as a graph

One of the greatest misunderstanding of git is trying to map it 1-1 to SVN. It’s easy to follow this fallacy when one starts migrating from SVN and all he/she fears the most is loosing the precious branches and all the rituals connected with them. Let’s state from the beginning:

Don’t be stupid. Learn Git. Stop trying mapping all SVN stuff to the new environment.

A simpler abstraction of git repo can be started from a graph of commits. Consider each git commit object (see here if you have no idea what kinds of object are present in git) connected with its parents by an oriented graph edge. A git repository can be considered then as a graph. Under normal circumstances that graph is a directed acyclic graph:

  • directed – a parent of a commit is a head of an edge and the commit is its tail
  • acyclic – as there are no returns to the earlier states, no cycles in the history

What is a branch then? It’s a named pointer to a given commit. It’s not a folder or any existing chain of commits. It’s a pointer sliding in time to the children commits. Do not attach more meaning to it.

What is a fast forward merge then? It’s a merge possible only if one of two commits being merged can be moved along the graph path to the second one.

One can try to preserve a linear history over one branch but as a directed acyclic graph defines only partial order, that may be impossible. Remember tagging important moments in history of a repository. Once the branch pointer was moved away, there may be no turning back – merge commit doesn’t have distinguishable parents.

To sum up. Don’t use improper abstractions over git. A repository should be considered as a directed acyclic graph. Use tags as breadcrumbs to point important events in history.
And use git for Linus sake!

Merge request policy

If you’re in a company with multiple teams in IT department, you could’ve been considering using GitHub Enterprise or its free replacement, GitLab. Beside providing Git hosting experience both can support you with one aspect previously unknown to your organization: pull/merge requests.

A pull request in GitHub, a merge request in GitLab provides the same functionality. They let other service users to issue demands of changes in a way that makes it easy to apply onto the original repository. The advantages of using this kind of change management over sending diffs with emails or other ways of applying fixes in other codebases are:

  1. Traceability – a request has its URL, is linkable and is public
  2. Permission granularity – everyone can be given a permission to read and fork repo but not to write. This lets the owners to stay owners and deal with issued requests rather than a broken codebase because of mistakes made by others (for particular flows, read here)
  3. True ownership – an owner is released from digging the dirt and moved to the acceptor state
  4. No more emails – email requests are no longer sent. The basic way of asking for a given change is… doing the change
  5. Learning over abusing – the requester is given an opportunity to leave with other codebase. It can cost a bit more, that’s for sure, but it spreads practices and knowledge. The most important thing for owner is to help to create pull requests, not to apply them onto repositories.

This kind of change can be painful for lazy people, which want to delegate, or rather, push away all their work with no insight about the requested changes. This can be a big learning opportunity as well. Adopting OSS rules, like merge-request-policy in your day-to-day job can increase your awareness and make you a happier developer.

It’s time to issue some pull requests!

Multidatacenter Cassandra cluster with slow cross DC connection

I’d like to discuss a particular failure scenario for a multi datacenter Cassandra cluster.
The setup to reproduce is following:

  • Two Cassandra data centers
    • DC1: n nodes
    • DC2: m nodes
  • TestKeyspace
  • NetworkTopologyStrategy with replication factors:
    • DC1: n (each key on each node)
    • DC2: m (each key on each node)
  • Tables in TestKeyspace are created with default settings
  • hinted hand-off enabled
  • read repair enabled

The writes and reads goes to the DC1. What can go wrong when whole DC2 goes down (or you get a network split)?
It occurs that read_repair is defined not by one but two probabilities:

What’s the difference between them? The first one shows probability of read repair across whole cluster, the second – rr across the same DC. If you have an occasionally failing connection, or a slow one using the first can bring you some troubles. If you plan for multi DC cluster and you can live with periodical runs nodetool repair instead of failing some of your LOCAL_QUORUM reads from time to time, switch to dc read repair and disable the global one.

For curious readers the class responsible for performing reads with read-repairs as well is AbstractReadExecutor