Reflections of a Reality Miner: 2009

Tuesday, September 1, 2009

PAPER: A Computational View Of Market Efficiency

An interesting paper showed up on arXiv, "A Computational View Of Market Efficiency". I find it interesting for two reasons. First, it is one of the rare papers in finance that observes that seemingly complex market behaviors have elegant and relatively simple descriptions in the context of computational information theory, a significant oversight in finance literature generally. Second, the paper makes no references to the existing work in algorithmic information theory that would show their result to be trivial and obvious, which demonstrates that a big cross-disciplinary gap in understanding still exists.

From the abstract:

We propose to study market efficiency from a computational viewpoint. Borrowing from theoretical computer science, we define a market to be efficient with respect to resources S (e.g., time, memory) if no strategy using resources S can make a profit. As a first step, we consider memory-m strategies whose action at time t depends only on the m previous observations at times t-m, ..., t-1. We introduce and study a simple model of market evolution, where strategies impact the market by their decision to buy or sell. We show that the effect of optimal strategies using memory m can lead to “market conditions” that were not present initially, such as (1) market bubbles and (2) the possibility for a strategy using memory m' > m to make a bigger profit than was initially possible. We suggest ours as a framework to rationalize the technological arms race of quantitative trading firms.

Tuesday, August 11, 2009

Software Patents and the Church-Turing thesis

One of the most interesting arguments against computer science algorithm patents is that the Church-Turing thesis makes them mathematics and therefore not patentable subject matter.

If this was an argument against all patents and intellectual property constructs then there would be a certain consistency to it, but most people that apply it appear to be applying it selectively without understanding it. A simple corollary of the Church-Turing argument reduces all possible material to mathematics.

In short, algorithmic information theory does not allow that theoretically meaningful distinctions among patentable materials exist. There might be utilitarian arguments that do not try to assert fundamental differences, but those arguments are rarely made. This suggests that the arguments both for and against are generally ignorant at best and disingenuous at worst, hardly the basis for sound policy in any case.

I think there are valid benefits from both sides of the argument, but the discourse is rarely intellgent nor does it acknowledge that we are making tradeoffs between shades of gray. I find the gross inconsistencies inherent in most peoples' positions on this topic to be more unreasonable than any particular position in the abstract.

Thursday, July 9, 2009

XML: When All You Have Is A Hammer...

I have never been a huge fan of XML, but for many purposes it is "good enough" such that my technical and aesthetic objections are rendered moot by the ubiquitous deployment of the standard. Unfortunately, like with many software technologies it has become popular to the point where people that know nothing else use it for applications where the suitability is specious. All too often, the momentum of the mob makes it difficult for sane engineering and design choices to prevail.

The widespread use of XML for encoding network protocols is a perfect example of this. Many people understand that XML is quite slow as a wire protocol but they misunderstand why this is true. The conventional assumption is that it is because XML documents are bulky and verbose; the performance problem can therefore be mitigated with data compression, reducing the bandwidth requirements. While it is true that XML-based protocols suffer from bandwidth bloat, the performance and scalability problems persist even when data compression is used, so clearly there are other factors involved. This tends to perplex software engineers who are not familiar with the nuances of wire protocol design.

A well-designed wire protocol has a very important property: it says what it is going to do before it does it. By prefacing all actions with metadata, it allows computers reading data off the wire to both know what to expect over the wire next and to infer the resource cost of reading that data off the wire. This allows the receiving system to do an enormous amount of optimization and resource conservation; if the receiving system does not know anything about what may be coming over the wire, it has to allow for everything and resources are allocated accordingly. An XML stream has the design flaw that you have to scan an unbounded string of characters for markers that terminate an atomic piece of data, encouraging both aggressive over-commitment of resources and suboptimal data structures where performance is concerned. If a malicious or defective XML stream goes over the wire, you may not be able to ascertain that fact until hard-coded resource limits are reached. Since resource management is generally the second most expensive operation in software after disk I/O, the very inefficient resource management adds up quickly where performance is concerned.

In older, more civilized wire protocols, the design pattern often used is known as "TLV" protocols ("Tag-Length-Value"). You send a metadata header saying what you are about to do ("Tag"), the number of bytes required to send the relevant data over the wire ("Length"), and then the data itself ("Value"). This allows the receiving system to precisely predict the minimum resources required to successfully process the transaction and to trap many error conditions before pulling bytes over the wire. In real-world protocol implementations, a good TLV protocol will often outperform an equivalent XML protocol by an order of magnitude. That is not a small difference.

This fact is not entirely lost on the standards bodies. There are ASN.1 encoding rules ("XER") for translating XML into the venerable (and complicated) ASN.1 wire protocol encoding standards, which generally use a TLV model. Furthermore, the new binary XML standards are designed in part to address this oversight by providing an efficient wire protocol encoding standard for XML. I am satisfied with this, but I hope I never see another "scalable" and "high-performance" XML-encoded wire protocol. We should know better by now.

Wednesday, July 8, 2009

Categorizing Access Methods for Metric Spaces

All real-world data types in computer science elements in finite metric spaces, a consequence of all data being encoded as bounded bit strings. Indexed access methods fall into one of three categories depending on which intrinsic properties of metric spaces are not directly exposed in the access method. What one can do with the access method is dependent on these properties, and good software engineers choose the method that supports the minimum set of operations required.

Cardinality-Preserving Methods: Otherwise known as hash tables. These methods preserve neither the metric nor the relative order of values. This works very well for many data sets because for a lot of applications there is no meaningful relationship between values, and they are easy to implement for very large systems. The access method is required to distinguish when two values are identical and little more. Unfortunately, this also means that analysis of the relationship between values is unsupported, limiting utility for many kinds of analytical applications.

Order-Preserving Methods: The classic B-tree data structure is canonical. In addition to preserving the cardinality of the values, it also preserves the relative order of the values in the metric space making it possible to efficiently implement range queries. Given a value, these types of methods only tell you in which direction to traverse the data structure to find a second value, not the traversal distance such that one could skip all intervening values. Order-preserving methods have an additional limitation in that they are poorly suited for multidimensional metric spaces. After all, what does "order" mean in multiple simultaneous dimensions? Access methods based on space-filling curves (Hilbert or Morton in most cases) are interesting answers to that question, but have not proven particularly successful.

Space-Preserving Methods: Few of these exist in literature, and virtually all have significant pathologies as algorithms. These methods fully preserve the properties of the metric space, notably that not only is the order of values preserved, but the distance between values is preserved such that it is not necessary to scan intervening values to find a second value. The access method can expose a fast path between any two values in the space by relative addressing. For scaling operations like joins, these kinds of access methods are very useful.

It is rarely as clean-cut as the above. For example, what category of access method is R-tree? It is a spatial access method, but the space is preserved locally (with caveats) rather than globally. One could also observe that this is the reason R-trees do not have theoretical properties one would expect of a true space-preserving access method.

In software engineering today, we often use order-preserving methods as a mediocre workaround for applications where space-preserving methods are recommended. For range queries, an order-preserving query limits you to doing a brute-force scan of a subset of the values in the structure, unlike cardinality-preserving methods. For some operations such as joins, an order-preserving workaround is poorly suited to the task though it is still more efficient than the brute-force alternative. A general-purpose space-preserving access method is one of the most important unsolved algorithm problems in computer science today.

Tuesday, July 7, 2009

Real-Time Search, A Computer Science Tutorial

The recent elevation of real-time search to buzzword prominence has made plain just how little the software and tech community understands about these types of systems. Databases focused on real-time search have been around a long time, usually called "stream" or "adaptive dataflow" systems. Here is a very short tutorial on the relationship between databases for real-time search (we will call them "stream databases") and more traditional relational database systems.

In a traditional database, you dynamically apply a constraint ("query") to a mostly static set of data. For a stream database, you reverse the relationship; data is dynamically applied to a mostly static set of constraints. In this way, the queries become the "data" in the database. Consider this a database analogue to the old computer science adage about equivalency of "data" and "code".

The implementation challenge this poses usually escapes immediate notice. Constraints are a spatial data type. As has been observed on this blog before, traditional access methods offer poor performance and scaling when applied to spatial data types. In other words, we can expect real-time search to scale about as well as spatial data types scale using relational access methods, which is to say far too poorly for Internet scale applications. Stream databases for "real-time search" have been sold by the major database vendors for many years, and the reason most people have never heard of them is that it is very hard to make them scale well enough for most applications.

So when I evaluate the recent barrage of real-time search startups, the first question that I always ask is very simple: How will it scale? Surprisingly, very little thought has usually been given to this question, and it seems that there is an assumption that real-time search has never really been tried rather than that it has been studied for decades by companies like IBM with little progress toward making it scale. Given the computer science history of real-time search I expect these startups will discover these problems very quickly, but I wonder how much money will be spent by investors before then.

Monday, July 6, 2009

NoSQL: Missing the Point

I recently saw this ComputerWorld article about the "NoSQL" software movement. To summarize what NoSQL is, take a relational database model and strip out every feature that cannot be efficiently implemented in a hash table. Which is to say, most of the more powerful capabilities that can be implemented in SQL using a good relational database.

The NoSQL crowd makes one good point: hash tables scale much better than fully relational databases currently. But discarding relational features from database engines is hardly revolutionary or even wise outside of narrow cases. The Achilles' Heel of NoSQL database engines is that in giving up the ability to efficiently do operations such as multi-attribute indexing and joins they become very poor for analytical applications.

There are two trends in applications: extremely large scaling requirements and real-time, contextual content that requires increasingly sophisticated analytical processing. To sacrifice the latter for the former is short-sighted. Instead of discarding the analytical features of relational models, we should be improving the scalability of relational model implementations. This is easier stated than done, as it will require replacing the current order-preserving access methods of traditional relational model implementations with space-preserving access methods, the latter being an unsolved algorithm problem in computer science.

Sunday, July 5, 2009

Semantic Graph Databases

Databases focused on semantic graph query workloads are an interesting problem space for a few different reasons. I was recently introduced to this algorithm space by multiple clients dangling very large carrots for platforms that can smash the current limitations. I have been tangentially aware of this for many years, but never had cause to really study it and knew very little about it until now.

Both the limitations and the carrots are rather stunning to the naive observer. First, the limitations are spectacularly, well, limited. The limits of tractable performance for a large system is on the order of hundreds of thousands of assertions (i.e. records). 1980 called and they want their database back. That is not large enough to be useful for a mid-sized enterprise, never mind building a semantic web for the entire Internet as many people would like to do. Second, if you could smash these limits and get it up to, say, a very speedy billion assertions, it would be worth tens of millions of dollars in short-term revenue, never mind the long-term potential.

Before any goes running off to grab the golden ring, it is worth noting that at the core of this poor performance is a very difficult unsolved algorithm problem in computer science. The basic query workload revolves around one of the most unscalable operations of traditional database architectures: the hierarchical self-join. Worse, your typical query involves an orgy of multiple self-joins that compound the pathologies of the individual case. It is the kryptonite of database performance and scalability.

There are two avenues to a solution. The first is to use a massively parallel computer system that supports very fast random access to memory using a data structure with poor locality. This does not describe your typical Intel system, though some more exotic supercomputing hardware approximates it. The second avenue is to develop a very scalable access method that supports parallel spatial joins in the general case, which has proven to be a very stubborn problem in algorithms and data structures literature.

A common mistake is to assume that prodigious caching is a suitable workaround, misunderstanding that the extremely poor performance mentioned above already refers to in-memory data structures. It is the non-locality of the cache-obviating data access pattern that is responsible for much of the poor performance in the first place, compounded by inefficient access methods. If trivial code tweaks for better cache efficiency and locality could solve this, it seems reasonable to assume that it has already been tried.

Tuesday, April 21, 2009

The Tractability of Market Equilibria

In the last few years there has been a growing number of papers in Game Theory, such as this one, that belie what I believe is a widespread and tacit assumption that markets can and will find a predictable long-term equilibrium if left to their own devices.

To quote Kamal Jain: "If your laptop can't find it, neither can the market."

The provable lack of tractable computability for a rapidly increasing number of simple, game-theoretic equilibrium problems carries with it the implication that most real markets are inherently unstable for all practical purposes. Furthermore, it suggests that the use of regulation to placate risk aversion will almost certainly have adverse, unintended consequences that limit the credibility of any claims of improved stability. It also provides yet another reason why so many hedge funds and self-styled quants fail at the game of attempting to apply predictive analytics to real markets.

Markets are essential tools for price discovery when optimizing resource allocation, but in some cases this does not imply that the market will have a stable equilibrium. Unfortunately for those averse to risk, and risk does carry a real cost in the markets, regulation generally has a limited ability to guarantee stability while imposing detrimental costs with respect to efficient and accurate price discovery.

Psychologically and sociologically we have a pervasive preference for stability, and it therefore is something of an inconvenient truth that not only are most things in life not stable but they cannot be made stable, even when discussing many human socioeconomic interactions. As with many things, it will be easier to convince ourselves that we can control complex economic systems than to accept the fact that we probably cannot.

Monday, April 13, 2009

Schooner Leaves Stealth Mode

Schooner Information Technology has announced a MySQL appliance and a memcached appliance apparently based on specially optimized hardware and software. This is a crowded startup space and the appliances are nice looking -- they better be given the $45,000 price tag -- but unremarkable in most other respects.

The proliferation of specialized appliances for MySQL applications raises some interesting questions when viewed in light of the rest of the database market. Many of these appliances do not solve a general database problem, they are using better hardware to cover for deficiencies in the MySQL software design. For example, it is well-established that for OLTP workloads the other major Open Source database PostgreSQL will run rings around MySQL for throughput on a given piece of server hardware. Spending a lot of money coaxing extra performance out of MySQL that could be had by installing free alternatives is not a recipe for long-term success or viability, and one of the reasons there are so many MySQL accelerator startups out there is that it is so easy to find deficiencies that can be addressed with modest effort.

Specialized appliances and accelerated hardware have a place, in the database world I would argue that high-end analytics is one such place, but I have a hard time coming up with an economic argument for fixing software deficiencies with hardware solutions that would lead me to believe such companies are good venture investments.

Saturday, April 11, 2009

Indexing and Data Set Degeneracy

A term unfamiliar to many people when discussing the scalability characteristics of indexing algorithms for large-scale databases is "degenerate" data. Most literature on indexing only considers specific data types when evaluating algorithms, such as text or spatial rectangles, and ignores the relationship between data set degeneracy and algorithm scalability.

The definition of the term "degenerate" was borrowed from topology theory, and has a similar meaning when one considers an indexed data structure to represent a vector space and data as objects/vectors in that space. A dimension in which the magnitude of a vector is zero is "degenerate". As a simple example in an Euclidean 2-space and assuming all geometries have sides parallel to an axis, a rectangle will be non-degenerate, a line will be semi-degenerate, and a point will be fully degenerate. Or in other words, all of these geometries are rectangles with varying degrees of degeneracy in 2-space, though we generally do not think of them this way.

This distinction is important because most indexing algorithms exploit the partitionability of a data set to maximize scalability, and partitionability in theory is a function of data degeneracy. Some algorithms, such as quad-trees, have practical scalability that varies dramatically as a function of the average degeneracy of the data set indexed using it. Other algorithms, such as high-concurrency B-trees (which represent an Euclidean 1-space), employ scaling optimizations that restrict them to only handling degenerate data types.

One of the most important unsolved problems in database systems is the lack of a scalable, general algorithm for indexing n-dimensional vector spaces. Currently the literature only describes algorithms that function on Internet scales for fully degenerate data sets in low dimensionality vector spaces. Unfortunately, more and more important data types are explicitly non-degenerate and high-dimensionality, such as geospatial geometries or video, or implicitly non-degenerate and high-dimensionality, such as many relational operations common to analytical processing. Understanding how data set degeneracy interacts with the indexed access methods found in database literature is crucial to understanding scaling behavior in very large databases, particularly if one is attempting to index data sets that may not be degenerate to some degree or if one is attempting large-scale analytics that are implicitly high-dimensional.

Friday, April 10, 2009

A Renaissance? Not Exactly.

Gary Orenstein posted an article on GigaOM this morning titled "The Data Mining Renaissance" that elicited a comment from me about the proper characterization of the evolution of the data mining market. While I agree that inexpensive data mining platforms based Google-like technologies are proliferating, the renaissance has been in the economics of data mining rather than the capabilities of such systems.

This is an important distinction because much of the interest that has been driving the uptake of large-scale data mining technology is in many of the same kinds of applications that are poorly served by the existing technology. It is in many ways a self-limiting resurgence. Much of the interest in data mining technologies is contextual targeting and preference modeling, which to do well on a large scale require overcoming the Achille's Heels of current data mining technology, poor real-time analytic performance and inadequate support for high-dimensionality data models.

In other words, the "renaissance" is that expensive data mining technologies are becoming less expensive over time, which is not exactly a remarkable observation in the technology industry. Unfortunately, what is really needed is data mining technologies that are better. I fully expect we will get these better technologies sooner rather than later, after all that is my business, but when that happens it will look more like a revolution than a renaissance as it will enable capabilities that are currently unavailable at any price.

An Inaugural Post

Traditional data mining operates on a simple data stripped of most of its real-world context, analyzed long after the event that was being measured has passed. Use of complex data types, data fusion, and real-time analysis on a large scale is virtually unknown despite obvious, demonstrable value. Real-time, deeply contextual, large-scale data analysis -- "reality mining" -- is a very different kind of technology that can be used for applications far beyond the ken of conventional data mining technologies.

The absence of reality mining platforms is not a market oversight but a technological limit, the consequence of a known unsolved problem in theoretical computer science. Reality mining captures the tantalizing essence of what is just beyond our grasp, the ability to analyze the history of the mundane as it unfolds in exceptional detail. Not only does this enable the obvious ability to fuse the real with the virtual, it allows the possibility of creating unprecedented predictive and behavioral models of individuals, societies, and the world at large that can be exploited in real-time. The implications are frankly astounding and like with all powerful technologies it comes with the potential for great benefits and great hazards.

As someone who is intimately involved in many aspects of the nascent field of reality mining, this blog is broadly dedicated to the topic of measuring, modeling, and analyzing reality. This includes the practical implementation aspects, the theoretical computer science aspects, the sociocultural aspects, and the political policy aspects that these types of technologies have. Additionally, separating the fiction from the fact with respect to the claims made in the technology industry regarding related technologies like Internet-scale databases and predictive analytics will also see significant coverage.

Enjoy!

Reflections of a Reality Miner