I’d like to discuss a particular failure scenario for a multi datacenter Cassandra cluster.
The setup to reproduce is following:
- Two Cassandra data centers
- DC1: n nodes
- DC2: m nodes
- NetworkTopologyStrategy with replication factors:
- DC1: n (each key on each node)
- DC2: m (each key on each node)
- Tables in TestKeyspace are created with default settings
- hinted hand-off enabled
- read repair enabled
The writes and reads goes to the DC1. What can go wrong when whole DC2 goes down (or you get a network split)?
It occurs that read_repair is defined not by one but two probabilities:
What’s the difference between them? The first one shows probability of read repair across whole cluster, the second – rr across the same DC. If you have an occasionally failing connection, or a slow one using the first can bring you some troubles. If you plan for multi DC cluster and you can live with periodical runs nodetool repair instead of failing some of your LOCAL_QUORUM reads from time to time, switch to dc read repair and disable the global one.
For curious readers the class responsible for performing reads with read-repairs as well is AbstractReadExecutor
Recently I’ve been moving around the topic of a deployment. Imagine a situation you’re being given a set of scripts, or script like objects used to deploy a set of applications. The so-called scripts are from the very basic like create-directory to complex, rooted in an organization infrastructure and tooling. Additionally, some of them are defined as groups of other scrips. For example, installing an application service starts with creation of a directory, then binaries are copied and finally, the service is registered.
The scripts are not covered with tests, but are hardened by years of successful usage. One could consider to rewrite them totally, and provide a full blown set of tests. This may be hard, as you throw away all the knowledge hidden behind scripts. Remember, that there were big companies that are no longer here, take Netscape as example.
I’ve spent quite a while about considering chef, PowerShell, Pupper even the msbuild with its tasks. What helped me to make up my mind was the famous Blue Book. Why not to consider a set of scrips as a bounded context? Just take a look at the picture provided by Martin Fowler here. Wrap all the older scripts in a context bubble providing mapping, mostly intellectual, to all the terms that are needed to be known outside. It’s more than wrapping all old scrips with an interface. There is a need of a real mapping with a glossary to let people which do not want to leave the bubble now exist in it for a while. What tool will be used for the new bounded context communicating with the other? That’s an open question. I’ll try to choose the best tool, with good enough test support. The only real requirement is the ability to provide the mapping to the old deployment tools’ context.
If you want to learn more, just take a loot at the great Eric Evan’s videos under this link: http://dddcommunity.org/?s=four+strategies
The project AzureDirectory provides an Azure implementation of the Lucene.NET abstraction – Directory. It targets in providing ability to store Lucene index in the Azure storage services. The code can be found in here AzureDirectory . The packages can be found on nuget here. There aren’t marked as prereleases.
The solutions consists of two projects:
The first provides implementation of the Lucene abstractions. The are a few classes, only needed for the feature implementation (implementations of Lucene abstractions). Additionally some utils class are introduced.
The code is structured with regions, which I personally dislike. Names of regions like: CTORS, internal methods, DIRECTORY METHODS shows the way the code is molded, with no classes holding common wrapped in region functionality. The lengthy methods and ctors are another disadvantage of this code base.
The spacing, using directives, fields that may be readonly are messy. Something which may be cleared with a ReSharper Clean Code is left for a reader to deal with.
You can find in there usages of Lucene obsolete API (like IndexInput.Close in disposal), as well as informative comments like:
// sometimes we get access denied on the 2nd stream…but not always. I haven’t tracked it down yet
// but this covers our tail until I do
It’s good and informative for author but leaving the project in an immature state.
The second project is not a test project but a sample app using the lib. No tests at all.
Summing up, after consideration, I wouldn’t use this implementation for my production Azure app. The code is badly composed, with no tests and left with comments pointing at situations where authors are aware of the unsolved-yet problem.