Reflections of a Reality Miner: April 2009

Tuesday, April 21, 2009

The Tractability of Market Equilibria

In the last few years there has been a growing number of papers in Game Theory, such as this one, that belie what I believe is a widespread and tacit assumption that markets can and will find a predictable long-term equilibrium if left to their own devices.

To quote Kamal Jain: "If your laptop can't find it, neither can the market."

The provable lack of tractable computability for a rapidly increasing number of simple, game-theoretic equilibrium problems carries with it the implication that most real markets are inherently unstable for all practical purposes. Furthermore, it suggests that the use of regulation to placate risk aversion will almost certainly have adverse, unintended consequences that limit the credibility of any claims of improved stability. It also provides yet another reason why so many hedge funds and self-styled quants fail at the game of attempting to apply predictive analytics to real markets.

Markets are essential tools for price discovery when optimizing resource allocation, but in some cases this does not imply that the market will have a stable equilibrium. Unfortunately for those averse to risk, and risk does carry a real cost in the markets, regulation generally has a limited ability to guarantee stability while imposing detrimental costs with respect to efficient and accurate price discovery.

Psychologically and sociologically we have a pervasive preference for stability, and it therefore is something of an inconvenient truth that not only are most things in life not stable but they cannot be made stable, even when discussing many human socioeconomic interactions. As with many things, it will be easier to convince ourselves that we can control complex economic systems than to accept the fact that we probably cannot.

Monday, April 13, 2009

Schooner Leaves Stealth Mode

Schooner Information Technology has announced a MySQL appliance and a memcached appliance apparently based on specially optimized hardware and software. This is a crowded startup space and the appliances are nice looking -- they better be given the $45,000 price tag -- but unremarkable in most other respects.

The proliferation of specialized appliances for MySQL applications raises some interesting questions when viewed in light of the rest of the database market. Many of these appliances do not solve a general database problem, they are using better hardware to cover for deficiencies in the MySQL software design. For example, it is well-established that for OLTP workloads the other major Open Source database PostgreSQL will run rings around MySQL for throughput on a given piece of server hardware. Spending a lot of money coaxing extra performance out of MySQL that could be had by installing free alternatives is not a recipe for long-term success or viability, and one of the reasons there are so many MySQL accelerator startups out there is that it is so easy to find deficiencies that can be addressed with modest effort.

Specialized appliances and accelerated hardware have a place, in the database world I would argue that high-end analytics is one such place, but I have a hard time coming up with an economic argument for fixing software deficiencies with hardware solutions that would lead me to believe such companies are good venture investments.

Saturday, April 11, 2009

Indexing and Data Set Degeneracy

A term unfamiliar to many people when discussing the scalability characteristics of indexing algorithms for large-scale databases is "degenerate" data. Most literature on indexing only considers specific data types when evaluating algorithms, such as text or spatial rectangles, and ignores the relationship between data set degeneracy and algorithm scalability.

The definition of the term "degenerate" was borrowed from topology theory, and has a similar meaning when one considers an indexed data structure to represent a vector space and data as objects/vectors in that space. A dimension in which the magnitude of a vector is zero is "degenerate". As a simple example in an Euclidean 2-space and assuming all geometries have sides parallel to an axis, a rectangle will be non-degenerate, a line will be semi-degenerate, and a point will be fully degenerate. Or in other words, all of these geometries are rectangles with varying degrees of degeneracy in 2-space, though we generally do not think of them this way.

This distinction is important because most indexing algorithms exploit the partitionability of a data set to maximize scalability, and partitionability in theory is a function of data degeneracy. Some algorithms, such as quad-trees, have practical scalability that varies dramatically as a function of the average degeneracy of the data set indexed using it. Other algorithms, such as high-concurrency B-trees (which represent an Euclidean 1-space), employ scaling optimizations that restrict them to only handling degenerate data types.

One of the most important unsolved problems in database systems is the lack of a scalable, general algorithm for indexing n-dimensional vector spaces. Currently the literature only describes algorithms that function on Internet scales for fully degenerate data sets in low dimensionality vector spaces. Unfortunately, more and more important data types are explicitly non-degenerate and high-dimensionality, such as geospatial geometries or video, or implicitly non-degenerate and high-dimensionality, such as many relational operations common to analytical processing. Understanding how data set degeneracy interacts with the indexed access methods found in database literature is crucial to understanding scaling behavior in very large databases, particularly if one is attempting to index data sets that may not be degenerate to some degree or if one is attempting large-scale analytics that are implicitly high-dimensional.

Friday, April 10, 2009

A Renaissance? Not Exactly.

Gary Orenstein posted an article on GigaOM this morning titled "The Data Mining Renaissance" that elicited a comment from me about the proper characterization of the evolution of the data mining market. While I agree that inexpensive data mining platforms based Google-like technologies are proliferating, the renaissance has been in the economics of data mining rather than the capabilities of such systems.

This is an important distinction because much of the interest that has been driving the uptake of large-scale data mining technology is in many of the same kinds of applications that are poorly served by the existing technology. It is in many ways a self-limiting resurgence. Much of the interest in data mining technologies is contextual targeting and preference modeling, which to do well on a large scale require overcoming the Achille's Heels of current data mining technology, poor real-time analytic performance and inadequate support for high-dimensionality data models.

In other words, the "renaissance" is that expensive data mining technologies are becoming less expensive over time, which is not exactly a remarkable observation in the technology industry. Unfortunately, what is really needed is data mining technologies that are better. I fully expect we will get these better technologies sooner rather than later, after all that is my business, but when that happens it will look more like a revolution than a renaissance as it will enable capabilities that are currently unavailable at any price.

An Inaugural Post

Traditional data mining operates on a simple data stripped of most of its real-world context, analyzed long after the event that was being measured has passed. Use of complex data types, data fusion, and real-time analysis on a large scale is virtually unknown despite obvious, demonstrable value. Real-time, deeply contextual, large-scale data analysis -- "reality mining" -- is a very different kind of technology that can be used for applications far beyond the ken of conventional data mining technologies.

The absence of reality mining platforms is not a market oversight but a technological limit, the consequence of a known unsolved problem in theoretical computer science. Reality mining captures the tantalizing essence of what is just beyond our grasp, the ability to analyze the history of the mundane as it unfolds in exceptional detail. Not only does this enable the obvious ability to fuse the real with the virtual, it allows the possibility of creating unprecedented predictive and behavioral models of individuals, societies, and the world at large that can be exploited in real-time. The implications are frankly astounding and like with all powerful technologies it comes with the potential for great benefits and great hazards.

As someone who is intimately involved in many aspects of the nascent field of reality mining, this blog is broadly dedicated to the topic of measuring, modeling, and analyzing reality. This includes the practical implementation aspects, the theoretical computer science aspects, the sociocultural aspects, and the political policy aspects that these types of technologies have. Additionally, separating the fiction from the fact with respect to the claims made in the technology industry regarding related technologies like Internet-scale databases and predictive analytics will also see significant coverage.

Enjoy!

Reflections of a Reality Miner