Data mining

The following was published on 6/1/2004 at




Investing Strategies


Models that work well on data about the past may not work in the future.

Check methods for weak points, like overfitting or ignoring illiquidity or business relationships.

Keep in mind some practical considerations when testing a theory.


Other Areas of Data-Mining

In 1992-1993, there were a number of bright investors who had “picked the lock” of the residential mortgage-backed securities market. Many of them had estimated complex multifactor relationships that allowed them to estimate the likely amount of mortgage prepayment within mortgage pools.

Armed with that knowledge, they bought some of the riskiest securities backed by portions of the cash flows from the pools. They probably estimated the past relationships properly, but the models failed when no-cost prepayment became common, and failed again when the Federal Reserve raised rates aggressively in 1994. The failures were astounding: David Askin’s hedge funds, Orange County, the funds at Piper Jaffray that Worth Bruntjen managed, some small life insurers, etc. If that wasn’t enough, there were many major financial institutions that dropped billions on this trade without failing.

What’s the lesson? Models that worked well in the past might not work so well in the future, particularly at high degrees of leverage. Small deviations from what made the relationship work in the past can be amplified by leverage into huge disasters.

I recommend Victor Niederhoffer and Laurel Kenner’s book, Practical Speculation, because the first half of the book is very good at debunking data-mining. But it also mines data on occasion. In Chapter 9, for example, the authors test methods to improve on buying and holding the index over long periods by adjusting position sizes based off of the results of prior years. Enough results were tested that it was likely that one of them might show something that would have worked in the past. My guess is that the significant results there are a statistical fluke and may not work in the future. The results did not work in the recent 2000-2002 downturn.

As an aside, one of the reasons Niederhoffer’s hedge fund blew up is that he placed too much trust in the idea that the data could tell him what events could not happen. The market has a funny way of doing what everyone “knows” it can’t, particularly when a majority of market participants rely on an event not happening. In this case, Niederhoffer knew that when U.S. banks fall by 90% in price and survive, typically they are a good value. Applying that same insight to banks in Thailand demanded too much of the data, and was fatal to his funds.

What to Watch Out for

Investors who are aware of data-mining and its dangers can spot trouble when they review quantitative analyses by looking for these seven signals:

1. Small changes in method lead to big changes in results. In these cases, the method has likely been too highly optimized. It may have achieved good results in the past through overfitting the model, which would interpret some of the noise of the past as a signal to return to the earlier analogy.

2. Good modeling takes into account the illiquidity of certain sectors of the market. Any method that comes out with a result that indicates you should invest a large percentage of money in a small asset class or small stock should be questioned. Illiquid or esoteric assets should be modeled with a liquidity penalty for investment. They can’t be traded, except at a high cost.

3. Be careful of models that force frequent trading, particularly if they ignore commission costs, bid/ask spreads, and, if you are large enough relative to the market, market impact costs. These factors make up a large portion of what is called implementation shortfall. In general, implementation shortfall often eats up half of the excess returns predicted by back-testing, even when back-testing is done with an eye to avoiding data-mining.

For a full description on the pitfalls of implementation shortfall, read Investing by the Numbers, by Jarrod X. Wilcox.  Chapter 10 discusses this issue in detail. This is the best single book I know of on quantitative methods in investing.

4. Be careful when a method uses a huge number of screens in order to come down to a tiny number of stocks and then, with little or no further analysis, says these are the ones to buy or sell. Though the method may have worked very well in the past, accounting data are, by their very nature, approximate and manipulable; they require further processing in order to be useful. Screening only winnows down the universe of stocks to a number small enough for security analysis to begin. It can never be a substitute for security analysis.

5. Avoid using quantitative methods that lack a rational business explanation. Effective quantitative methods usually come from processes that mimic the actions of intelligent businessmen. Never confuse correlation with causation. Sometimes two economic variables with little obvious financial relationship to each other will show a statistically significant relationship in the past. Two financials merely being correlated in the past does not mean that they will be so in the future. This is particularly true when there is no business reason that relates them.

6. Look for the use of a control. A control is a portion of the data series not used to estimate the relationship. It’s left to the side to test the relationship after the “best” model is chosen. Often, the control will indicate that the “best” method isn’t all that good. And beware of methods that use the control data multiple times in order to test the best methods. That defeats the purpose of a control by data-mining the control sample.

7. One of the trends in accounting is to make increasingly detailed rules in an attempt (wrongheaded) to fit each individual company more precisely. The problem with that is it makes many ratios difficult to compare across companies and industries without extra massaging to make the data comparable. This makes thinning out a stock universe via screening to be less useful as a tool. For quantitative analysis to succeed, the data need to represent the same thing across different firms.

Practical Recommendations

There are many pitfalls in quantitative analysis. But three simple considerations will help protect investors from the dangers of data-mining.

1. Paper trade any new quantitative method that you consider using. Be sure to charge yourself reasonable commissions, and take into account the bid/ask spread. Take into account market impact costs if you are trading in a particularly illiquid market. Even after all this, remember that your real-world results often will underperform the model.

2. Think in terms of sustainable competitive advantage. What are you bringing to the process that is not easily replicable? How does the method allow you to use your business judgment? Is the method so commonly used that even if it is a good

1, 2  - View Full Page