Sunday, July 5, 2009

Semantic Graph Databases

Databases focused on semantic graph query workloads are an interesting problem space for a few different reasons. I was recently introduced to this algorithm space by multiple clients dangling very large carrots for platforms that can smash the current limitations. I have been tangentially aware of this for many years, but never had cause to really study it and knew very little about it until now.

Both the limitations and the carrots are rather stunning to the naive observer. First, the limitations are spectacularly, well, limited. The limits of tractable performance for a large system is on the order of hundreds of thousands of assertions (i.e. records). 1980 called and they want their database back. That is not large enough to be useful for a mid-sized enterprise, never mind building a semantic web for the entire Internet as many people would like to do. Second, if you could smash these limits and get it up to, say, a very speedy billion assertions, it would be worth tens of millions of dollars in short-term revenue, never mind the long-term potential.

Before any goes running off to grab the golden ring, it is worth noting that at the core of this poor performance is a very difficult unsolved algorithm problem in computer science. The basic query workload revolves around one of the most unscalable operations of traditional database architectures: the hierarchical self-join. Worse, your typical query involves an orgy of multiple self-joins that compound the pathologies of the individual case. It is the kryptonite of database performance and scalability.

There are two avenues to a solution. The first is to use a massively parallel computer system that supports very fast random access to memory using a data structure with poor locality. This does not describe your typical Intel system, though some more exotic supercomputing hardware approximates it. The second avenue is to develop a very scalable access method that supports parallel spatial joins in the general case, which has proven to be a very stubborn problem in algorithms and data structures literature.

A common mistake is to assume that prodigious caching is a suitable workaround, misunderstanding that the extremely poor performance mentioned above already refers to in-memory data structures. It is the non-locality of the cache-obviating data access pattern that is responsible for much of the poor performance in the first place, compounded by inefficient access methods. If trivial code tweaks for better cache efficiency and locality could solve this, it seems reasonable to assume that it has already been tried.

No comments:

Post a Comment