Comparing Backtest And Out-Of-Sample Performance On A Large Cohort Of Trading Algorithms

All that Glitters Is Not Gold: Comparing Backtest And Out-Of-Sample Performance On A Large Cohort Of Trading Algorithms

Thomas Wiecki
Quantopian Inc

Andrew Campbell
Quantopian Inc.

Justin Lent
Quantopian Inc

Jessica Stauth
Quantopian Inc

March 9, 2016

Abstract:

When automated trading strategies are developed and evaluated using backtests on historical pricing data, there exists a tendency to overfit to the past. Using a unique dataset of 888 algorithmic trading strategies developed and backtested on the Quantopian platform with at least 6 months of out-of-sample performance, we study the prevalence and impact of backtest overfitting. Specifically, we find that commonly reported backtest evaluation metrics like the Sharpe ratio offer little value in predicting out of sample performance (R² < 0.025). In contrast, higher order moments, like volatility and maximum drawdown, as well as portfolio construction features, like hedging, show significant predictive value of relevance to quantitative finance practitioners. Moreover, in line with prior theoretical considerations, we find empirical evidence of overfitting – the more backtesting a quant has done for a strategy, the larger the discrepancy between backtest and out-of-sample performance. Finally, we show that by training non-linear machine learning classifiers on a variety of features that describe backtest behavior, out-of-sample performance can be predicted at a much higher accuracy (R² = 0.17) on hold-out data compared to using linear, univariate features. A portfolio constructed on predictions on hold-out data performed significantly better out-of-sample than one constructed from algorithms with the highest backtest Sharpe ratios.

All that Glitters Is Not Gold: Comparing Backtest And Out-Of-Sample Performance On A Large Cohort Of Trading Algorithms – Introduction

When developing automated trading strategies, it is common practice to test algorithms on historical data, a procedure known as backtesting. Backtest results are often used as a proxy for the expected future performance of a strategy. Thus, in an effort to optimize expected out-of-sample (OOS) performance, quants often spend considerable time tuning algorithm parameters to produce optimal backtest performance on in-sample (IS) data. Several authors have pointed out how this practice of backtest “overfitting” can lead to strategies that leverage to specific noise patterns in the historical data rather than the signal that was meant to be exploited (Lopez de Prado [2013], Bailey et al. [2014a]; Bailey et al. [2014b]). When deployed into out-of-sample trading, the expected returns of overfit strategies have been hypothesized to be random at best and consistently negative at worst.

The question of how predictive a backtest is of future performance is as critical as it is ubiquitous to quantitative asset managers who often, at least partly, rely on backtest performance in their hiring and allocation decisions. In order to quantify backtest and out-of-sample performance, a large number of performance metrics have been proposed. While the Sharpe ratio (Sharpe [1966]) is the most widely known, it is probably also the most widely criticized (Spurgin [2001], Lin, Chou [2003]; Lo [2009]; Bailey, Lopez de Prado [2014]) . A large number of supposedly improved metrics, such as the information ratio or Calmar ratio (Young [1991]), have been proposed, but it is unclear what predictive value each metric carries.

Backtest overfitting also appears to be a problem in the academic literature on quantitative finance where trading strategies with impressive backtest performance are frequently published which do not seem to match their OOS performance (for studies on overfitting see e.g. Schorfheide and Wolpin [2012]; McClean and Pontiff [2012]; Lopez de Prado [2013]; Bailey [2014a]; Bailey [2014b]; Beaudan [2013]; Burns [2006]; Harvey et al. [2014]; Harvey, Liu, & Zhu [2016]). A recent simulation study by Bailey et al. [2013] demonstrates how easy it is to achieve stellar backtest performance on a strategy that in reality has no edge in the market. Specifically, the authors simulate return paths with an expected Sharpe Ratio of 0 and derive probabilities to achieve Sharpe Ratios well above 1 after trying a few strategy variations under a limited backtest time-frame. When no compensatory effects are present in the market, selecting such a strategy based on in-sample Sharpe Ratio will lead to a disappointing out-of-sample Sharpe Ratio of 0. However, when assuming such compensatory market forces like overcrowded investment opportunities to be at play, selecting strategies with high-in-sample Sharpe ratio would even lead to negative out-of-sample Sharpe ratio. As these results are purely theoretical, it is not clear which of these two relationships – zero or negative correlation – between IS and OOS performance exist in reality.

In this study, we aim to provide an empirical answer to the relationship between the IS and OOS performance based on data set and compare various performance metrics that have been proposed in the literature. Towards this goal, we have assembled a data set of 888 unique US equities trading algorithms developed on the Quantopian platform and backtested from 2010 through 2015 with at least 6 to 12 months of true OOS performance. Quantopian provides a web-based platform to research, develop, backtest and deploy trading algorithms. To date, users of the platform have run over 800,000 backtests. While the site terms-of-use strictly prohibit direct investigation of algorithm source code, we are granted access to detailed data exhaust, returns, positions, and transactions an algorithm generates when backtested over arbitrary date ranges. As the encrypted algorithm code is time-stamped in our database, we can easily determine exactly what historical market data the author had access to during development. We call this time prior to algorithm deployment the insample period. The simulated performance accumulated since an algorithm’s deployment date represents true out-of-sample data.

As we will show below, backtest performance of single metrics have very weak correlations with their out-of-sample equivalent (with some exceptions). This result by itself might lead to the conclusion that backtests carry very little predictive information about future performance. However, by applying machine learning algorithms on a variety of features designed to describe algorithm behavior we show that OOS performance can indeed be predicted.