Google’s Percolator

Next great article from Google: Percolator. The whitepaper describes a new indexing engine, which no longer uses the massive MapReduce algorithm to calculate web pages indexes. As always, Google used a few already existing tools, like BigTable and build an index updater, which opposite to MapReduce updates small chunks of the index repository drastically reducing :-) the time between a page being crawled and being indexed. Worth to read, even for the transaction scheme implemented on the non-transactional BigTable.

Solrnet NHibernate integration

Every developer which creates a scalable applications with high read/write ratio finally has to move to some query oriented storage, which allows almost instant, advanced querying. Solr is one of the frequently used solutions, providing a nice performance with advanced indexing/querying. Talking to Solr needs an API and for .NET and you can use SolrNet. As it is described on the project page, it provides a nice and simple way to integrate with NHibernate, simply by calling following code:

NHibernate.Cfg.Configuration cfg = SetupNHibernate();
var cfgHelper = new NHibernate.SolrNet.CfgHelper();
cfgHelper.Configure(cfg, true); // true -> autocommit Solr after every operation (not really recommended)

The integration with NH allows you to use an extended session to query the solr server. Additionally, all your updates, inserts, deletes will be automatically called on the solr. This feature is provided by NH events listeners registered by the described configuration helper. Let’s take a look and audit a bit of code:

public Configuration Configure(Configuration config, bool autoCommit)
{
 foreach (Type type in this.mapper.GetRegisteredTypes())
 {
 Type type2 = typeof(SolrNetListener<>).MakeGenericType(new Type[] { type });
 Type serviceType = typeof(ISolrOperations<>).MakeGenericType(new Type[] { type });
 object service = this.provider.GetService(serviceType);
 ICommitSetting listener = (ICommitSetting) Activator.CreateInstance(type2, new object[] { service });
 listener.Commit = autoCommit;
 this.SetListener(config, listener);
 }
 return config;
}

For each of mapped types a new listener is generated. Is it OK? Why do not use one listener (with no generics at all), handling all the mapped types?
Additionally, doesn’t SetListener method clear all the previously registered listeners, so… what about the types previously handled?

The more interesting question can be raised when looking through the SolrNetListener<T> code. All the listeners in NH are singletons. They are initialized during the start up phase, and used till the application end. Hence, listeners should be stateless, or use some kind of current context resolvers passed in their constructor. The SolrNetListener uses WeakHashtables fields to store the entities which should be flushed to the solr. Static field (because the listener is a singleton)? What about the race conditions, locking etc.? Let’s take a look:

The example of Delete method, called, when solr delete should be deferred (session uses a transaction) shows that listener GLOBALLY locks the execution of all threads:

private void Delete(ITransaction s, T entity)
{
 lock (this.entitiesToDelete.SyncRoot)
 {
 if (!this.entitiesToDelete.Contains(s))
 {
 this.entitiesToDelete[s] = new List<T>();
 }
 ((IList<T>) this.entitiesToAdd[s]).Add(entity);
 }
}

furthermore, data can be committed to the solr server, but not committed to the sql database! It can happen when flushing occurs. The override of OnFlush method saves the data to the solr before flushing the db changes.
What if optimistic locking rolls back the whole transaction after the data were stored in the solr? I’d rather have no data in solr and use some background worker to upsert data in solr then have copies of nonexisting entities.

The right behavior for the solr, with no global locks, flushing changes to the solr after a sql db transaction commit can be done with using an Interceptor and rewritten the event handler. I’ll describe it in the next post.

Regards

Deiphobus design, pt. 1

It’s the right time to write about Deiphobus design. I’ll start with an example of usage, next I’ll move to configuration, serialization and model created from the config. In the next topic the event design of session implementation, the identity map and its usage will be explained as well as the lifetime of a session and query API.

The configuration provided with Deiphobus is a fluent, “just code” configuration. Each entity is described with a generic EntityClassMap.

/// <summary>
/// The mapping of the class, marking it as entity, stored under separate key.
/// </summary>
/// <typeparam name="T">The type to be mapped.
public abstract class EntityClassMap<T>
 where T : class
{
 protected void Id(Expression<Func<T, Guid>> id)
 protected IndexPart IndexBy(Expression> memberExpression)
 protected void SetSerializer<TSerializer>( )
 where TSerializer : ISerializer
}

The class interface was designed to be similar to Fluent NHibernate:

  • class specifying this type, should be a mapped entity class
  • Id method marks the specific property as id. It’s worth to notice, that only Guid identifiers are available
  • the second method is IndexBy used for marking a property to be indexed with an inverted index. Only the properties marked with this method can be queried in Deiphobus queries. Running query on a not indexed property will throw an exception
  • the very last method, allows to set a custom serializer type for the mapped entity type

All the mappings are consumed by mapping container registering all entity class maps in itself. The maps are translated into EntityClassModel object, describing the specific entity properties. This process takes place when the session factory is created. On the basis of each model class, the object implementing the interface IEntityPersister is created. The implementation of the persister provides methods like: GetPropertyValue or GetIndexedPropertyValues with IL code emitted, to overcome the reflection overhead. This class will be described later, the EntityClassModel‘s method signatures can be seen below:

/// <summary>
/// The class representing a model of mapped entity.
/// </summary>
public class EntityClassModel
{
 public EntityClassModel(Type classType, PropertyInfo id, object idUnsavedValue, IEnumerable<IndexedProperty> indexedProperties)
 {
  ClassType = classType;
  Id = id;
  IdUnsavedValue = idUnsavedValue;
  IndexedProperties = indexedProperties.ToList().AsReadOnly();
 }
 public Type ClassType { get; private set; }
 public PropertyInfo Id { get; private set; }
 public object IdUnsavedValue { get; private set; }
 public IEnumerable<IndexedProperty> IndexedProperties { get; private set; }
 public Type SerializerType { get; set; }
}

The very last part of this entry, is for serialization in Deiphobus. Because of the usage of Cassandra, each entity is stored under one key, in one column family, in one column. The entity is serialized in the moment of storing. The serialized entity is stored in Cassandra as well as its inverted indexes based on values retrieved just before saving the entity in the database. In the current moment, two levels of serializers can be setup:

  • the default, used by all classes not having their own
  • entity class specific

The rest of types is always serialized using the default serializer. This behavior may be subject to change.

Dremel

I’ve just finished reading Dremel whitepaper. It seems that Google one more time brought to life something, which may change the IT world. Imagine queries running against trillions of rows and returning results in a few seconds, imagine fault tolerant db, that scales linear and still, allows you to query it in a very advanced ways (for instance, using grouping). Yeah, I’m aware of VoltDB but the Dremel’s description was just astonishing.

On the other hand: have you ever had a possibility to test your scalable app on 2900 servers? :)