Saturday, April 11, 2009

Indexing and Data Set Degeneracy

A term unfamiliar to many people when discussing the scalability characteristics of indexing algorithms for large-scale databases is "degenerate" data. Most literature on indexing only considers specific data types when evaluating algorithms, such as text or spatial rectangles, and ignores the relationship between data set degeneracy and algorithm scalability. 

The definition of the term "degenerate" was borrowed from topology theory, and has a similar meaning when one considers an indexed data structure to represent a vector space and data as objects/vectors in that space.  A dimension in which the magnitude of a vector is zero is "degenerate". As a simple example in an Euclidean 2-space and assuming all geometries have sides parallel to an axis, a rectangle will be non-degenerate, a line will be semi-degenerate, and a point will be fully degenerate.  Or in other words, all of these geometries are rectangles with varying degrees of degeneracy in 2-space, though we generally do not think of them this way.

This distinction is important because most indexing algorithms exploit the partitionability of a data set to maximize scalability, and partitionability in theory is a function of data degeneracy. Some algorithms, such as quad-trees, have practical scalability that varies dramatically as a function of the average degeneracy of the data set indexed using it. Other algorithms, such as high-concurrency B-trees (which represent an Euclidean 1-space), employ scaling optimizations that restrict them to only handling degenerate data types. 

One of the most important unsolved problems in database systems is the lack of a scalable, general algorithm for indexing n-dimensional vector spaces. Currently the literature only describes algorithms that function on Internet scales for fully degenerate data sets in low dimensionality vector spaces. Unfortunately, more and more important data types are explicitly non-degenerate and high-dimensionality, such as geospatial geometries or video, or implicitly non-degenerate and high-dimensionality, such as many relational operations common to analytical processing. Understanding how data set degeneracy interacts with the indexed access methods found in database literature is crucial to understanding scaling behavior in very large databases, particularly if one is attempting to index data sets that may not be degenerate to some degree or if one is attempting large-scale analytics that are implicitly high-dimensional. 


No comments:

Post a Comment