Thursday, July 9, 2009

XML: When All You Have Is A Hammer...

I have never been a huge fan of XML, but for many purposes it is "good enough" such that my technical and aesthetic objections are rendered moot by the ubiquitous deployment of the standard. Unfortunately, like with many software technologies it has become popular to the point where people that know nothing else use it for applications where the suitability is specious. All too often, the momentum of the mob makes it difficult for sane engineering and design choices to prevail.

The widespread use of XML for encoding network protocols is a perfect example of this. Many people understand that XML is quite slow as a wire protocol but they misunderstand why this is true. The conventional assumption is that it is because XML documents are bulky and verbose; the performance problem can therefore be mitigated with data compression, reducing the bandwidth requirements. While it is true that XML-based protocols suffer from bandwidth bloat, the performance and scalability problems persist even when data compression is used, so clearly there are other factors involved. This tends to perplex software engineers who are not familiar with the nuances of wire protocol design.

A well-designed wire protocol has a very important property: it says what it is going to do before it does it. By prefacing all actions with metadata, it allows computers reading data off the wire to both know what to expect over the wire next and to infer the resource cost of reading that data off the wire. This allows the receiving system to do an enormous amount of optimization and resource conservation; if the receiving system does not know anything about what may be coming over the wire, it has to allow for everything and resources are allocated accordingly. An XML stream has the design flaw that you have to scan an unbounded string of characters for markers that terminate an atomic piece of data, encouraging both aggressive over-commitment of resources and suboptimal data structures where performance is concerned. If a malicious or defective XML stream goes over the wire, you may not be able to ascertain that fact until hard-coded resource limits are reached. Since resource management is generally the second most expensive operation in software after disk I/O, the very inefficient resource management adds up quickly where performance is concerned.

In older, more civilized wire protocols, the design pattern often used is known as "TLV" protocols ("Tag-Length-Value"). You send a metadata header saying what you are about to do ("Tag"), the number of bytes required to send the relevant data over the wire ("Length"), and then the data itself ("Value"). This allows the receiving system to precisely predict the minimum resources required to successfully process the transaction and to trap many error conditions before pulling bytes over the wire. In real-world protocol implementations, a good TLV protocol will often outperform an equivalent XML protocol by an order of magnitude. That is not a small difference.

This fact is not entirely lost on the standards bodies. There are ASN.1 encoding rules ("XER") for translating XML into the venerable (and complicated) ASN.1 wire protocol encoding standards, which generally use a TLV model. Furthermore, the new binary XML standards are designed in part to address this oversight by providing an efficient wire protocol encoding standard for XML. I am satisfied with this, but I hope I never see another "scalable" and "high-performance" XML-encoded wire protocol. We should know better by now.

Wednesday, July 8, 2009

Categorizing Access Methods for Metric Spaces

All real-world data types in computer science elements in finite metric spaces, a consequence of all data being encoded as bounded bit strings. Indexed access methods fall into one of three categories depending on which intrinsic properties of metric spaces are not directly exposed in the access method. What one can do with the access method is dependent on these properties, and good software engineers choose the method that supports the minimum set of operations required.

Cardinality-Preserving Methods: Otherwise known as hash tables. These methods preserve neither the metric nor the relative order of values. This works very well for many data sets because for a lot of applications there is no meaningful relationship between values, and they are easy to implement for very large systems. The access method is required to distinguish when two values are identical and little more. Unfortunately, this also means that analysis of the relationship between values is unsupported, limiting utility for many kinds of analytical applications.

Order-Preserving Methods: The classic B-tree data structure is canonical. In addition to preserving the cardinality of the values, it also preserves the relative order of the values in the metric space making it possible to efficiently implement range queries. Given a value, these types of methods only tell you in which direction to traverse the data structure to find a second value, not the traversal distance such that one could skip all intervening values. Order-preserving methods have an additional limitation in that they are poorly suited for multidimensional metric spaces. After all, what does "order" mean in multiple simultaneous dimensions? Access methods based on space-filling curves (Hilbert or Morton in most cases) are interesting answers to that question, but have not proven particularly successful.

Space-Preserving Methods: Few of these exist in literature, and virtually all have significant pathologies as algorithms. These methods fully preserve the properties of the metric space, notably that not only is the order of values preserved, but the distance between values is preserved such that it is not necessary to scan intervening values to find a second value. The access method can expose a fast path between any two values in the space by relative addressing. For scaling operations like joins, these kinds of access methods are very useful.

It is rarely as clean-cut as the above. For example, what category of access method is R-tree? It is a spatial access method, but the space is preserved locally (with caveats) rather than globally. One could also observe that this is the reason R-trees do not have theoretical properties one would expect of a true space-preserving access method.

In software engineering today, we often use order-preserving methods as a mediocre workaround for applications where space-preserving methods are recommended. For range queries, an order-preserving query limits you to doing a brute-force scan of a subset of the values in the structure, unlike cardinality-preserving methods. For some operations such as joins, an order-preserving workaround is poorly suited to the task though it is still more efficient than the brute-force alternative. A general-purpose space-preserving access method is one of the most important unsolved algorithm problems in computer science today.

Tuesday, July 7, 2009

Real-Time Search, A Computer Science Tutorial

The recent elevation of real-time search to buzzword prominence has made plain just how little the software and tech community understands about these types of systems. Databases focused on real-time search have been around a long time, usually called "stream" or "adaptive dataflow" systems. Here is a very short tutorial on the relationship between databases for real-time search (we will call them "stream databases") and more traditional relational database systems.

In a traditional database, you dynamically apply a constraint ("query") to a mostly static set of data. For a stream database, you reverse the relationship; data is dynamically applied to a mostly static set of constraints. In this way, the queries become the "data" in the database. Consider this a database analogue to the old computer science adage about equivalency of "data" and "code".

The implementation challenge this poses usually escapes immediate notice. Constraints are a spatial data type. As has been observed on this blog before, traditional access methods offer poor performance and scaling when applied to spatial data types. In other words, we can expect real-time search to scale about as well as spatial data types scale using relational access methods, which is to say far too poorly for Internet scale applications. Stream databases for "real-time search" have been sold by the major database vendors for many years, and the reason most people have never heard of them is that it is very hard to make them scale well enough for most applications.

So when I evaluate the recent barrage of real-time search startups, the first question that I always ask is very simple: How will it scale? Surprisingly, very little thought has usually been given to this question, and it seems that there is an assumption that real-time search has never really been tried rather than that it has been studied for decades by companies like IBM with little progress toward making it scale. Given the computer science history of real-time search I expect these startups will discover these problems very quickly, but I wonder how much money will be spent by investors before then.

Monday, July 6, 2009

NoSQL: Missing the Point

I recently saw this ComputerWorld article about the "NoSQL" software movement. To summarize what NoSQL is, take a relational database model and strip out every feature that cannot be efficiently implemented in a hash table. Which is to say, most of the more powerful capabilities that can be implemented in SQL using a good relational database.

The NoSQL crowd makes one good point: hash tables scale much better than fully relational databases currently. But discarding relational features from database engines is hardly revolutionary or even wise outside of narrow cases. The Achilles' Heel of NoSQL database engines is that in giving up the ability to efficiently do operations such as multi-attribute indexing and joins they become very poor for analytical applications.

There are two trends in applications: extremely large scaling requirements and real-time, contextual content that requires increasingly sophisticated analytical processing. To sacrifice the latter for the former is short-sighted. Instead of discarding the analytical features of relational models, we should be improving the scalability of relational model implementations. This is easier stated than done, as it will require replacing the current order-preserving access methods of traditional relational model implementations with space-preserving access methods, the latter being an unsolved algorithm problem in computer science.

Sunday, July 5, 2009

Semantic Graph Databases

Databases focused on semantic graph query workloads are an interesting problem space for a few different reasons. I was recently introduced to this algorithm space by multiple clients dangling very large carrots for platforms that can smash the current limitations. I have been tangentially aware of this for many years, but never had cause to really study it and knew very little about it until now.

Both the limitations and the carrots are rather stunning to the naive observer. First, the limitations are spectacularly, well, limited. The limits of tractable performance for a large system is on the order of hundreds of thousands of assertions (i.e. records). 1980 called and they want their database back. That is not large enough to be useful for a mid-sized enterprise, never mind building a semantic web for the entire Internet as many people would like to do. Second, if you could smash these limits and get it up to, say, a very speedy billion assertions, it would be worth tens of millions of dollars in short-term revenue, never mind the long-term potential.

Before any goes running off to grab the golden ring, it is worth noting that at the core of this poor performance is a very difficult unsolved algorithm problem in computer science. The basic query workload revolves around one of the most unscalable operations of traditional database architectures: the hierarchical self-join. Worse, your typical query involves an orgy of multiple self-joins that compound the pathologies of the individual case. It is the kryptonite of database performance and scalability.

There are two avenues to a solution. The first is to use a massively parallel computer system that supports very fast random access to memory using a data structure with poor locality. This does not describe your typical Intel system, though some more exotic supercomputing hardware approximates it. The second avenue is to develop a very scalable access method that supports parallel spatial joins in the general case, which has proven to be a very stubborn problem in algorithms and data structures literature.

A common mistake is to assume that prodigious caching is a suitable workaround, misunderstanding that the extremely poor performance mentioned above already refers to in-memory data structures. It is the non-locality of the cache-obviating data access pattern that is responsible for much of the poor performance in the first place, compounded by inefficient access methods. If trivial code tweaks for better cache efficiency and locality could solve this, it seems reasonable to assume that it has already been tried.