Wednesday, July 8, 2009

Categorizing Access Methods for Metric Spaces

All real-world data types in computer science elements in finite metric spaces, a consequence of all data being encoded as bounded bit strings. Indexed access methods fall into one of three categories depending on which intrinsic properties of metric spaces are not directly exposed in the access method. What one can do with the access method is dependent on these properties, and good software engineers choose the method that supports the minimum set of operations required.

Cardinality-Preserving Methods: Otherwise known as hash tables. These methods preserve neither the metric nor the relative order of values. This works very well for many data sets because for a lot of applications there is no meaningful relationship between values, and they are easy to implement for very large systems. The access method is required to distinguish when two values are identical and little more. Unfortunately, this also means that analysis of the relationship between values is unsupported, limiting utility for many kinds of analytical applications.

Order-Preserving Methods: The classic B-tree data structure is canonical. In addition to preserving the cardinality of the values, it also preserves the relative order of the values in the metric space making it possible to efficiently implement range queries. Given a value, these types of methods only tell you in which direction to traverse the data structure to find a second value, not the traversal distance such that one could skip all intervening values. Order-preserving methods have an additional limitation in that they are poorly suited for multidimensional metric spaces. After all, what does "order" mean in multiple simultaneous dimensions? Access methods based on space-filling curves (Hilbert or Morton in most cases) are interesting answers to that question, but have not proven particularly successful.

Space-Preserving Methods: Few of these exist in literature, and virtually all have significant pathologies as algorithms. These methods fully preserve the properties of the metric space, notably that not only is the order of values preserved, but the distance between values is preserved such that it is not necessary to scan intervening values to find a second value. The access method can expose a fast path between any two values in the space by relative addressing. For scaling operations like joins, these kinds of access methods are very useful.

It is rarely as clean-cut as the above. For example, what category of access method is R-tree? It is a spatial access method, but the space is preserved locally (with caveats) rather than globally. One could also observe that this is the reason R-trees do not have theoretical properties one would expect of a true space-preserving access method.

In software engineering today, we often use order-preserving methods as a mediocre workaround for applications where space-preserving methods are recommended. For range queries, an order-preserving query limits you to doing a brute-force scan of a subset of the values in the structure, unlike cardinality-preserving methods. For some operations such as joins, an order-preserving workaround is poorly suited to the task though it is still more efficient than the brute-force alternative. A general-purpose space-preserving access method is one of the most important unsolved algorithm problems in computer science today.

No comments:

Post a Comment