Limitations Of Claims Trading Strategy Evaluation

Limitations Of Quantitative Claims About Trading Strategy Evaluation

July 15, 2016

Abstract:

One of the key assumptions of quantitative trading strategy evaluation is that Type II errors (missed discoveries) are preferable to Type I errors (false discoveries.) However, practitioners have known for long that the statistical properties of some genuine trading strategies are often indistinguishable from those of random trading strategies. Therefore, any adjustments of statistics to guard against p-hacking increase Type II error unless the power of the test is high. At the same time, the power of the test is limited by insufficient samples and changing market conditions. Furthermore, genuine strategies with statistical properties that are similar to those of random strategies may overfit due to favorable market conditions but fail when market conditions change. These facts severely limit the effectiveness of quantitative claims about trading strategy evaluation. Practitioners have instead resorted to Monte Carlo simulations and stochastic modeling in an effort to increase the chances of identifying robust trading strategies but these methods also have severe limitations due to changing market conditions, selection bias and data snooping. In this paper we present two examples that demonstrate the limitation of quantitative evaluation of trading strategies and we claim that the most effective way of guarding against overfitting and selection bias is by limiting the applications of backtesting to a class of strategies that employ similar but simple predictors of price. We claim that determining when market conditions change is in many cases fundamentally more important than any quantitative claims about trading strategy evaluation.

Limitations Of Quantitative Claims About Trading Strategy Evaluation – Introduction

Traders do not always have the luxury of testing trading strategies on a forward sample with real money because this takes time and it is costly in case of performance failure. Therefore, traders look for ex-ante measures of robustness of trading strategies and to achieve that resort to the use of quantitative analysis. Usually, a trading strategy is developed on in-sample data and validated on out-of-sample data. Although the academic community has been aware of the limited effectiveness of out-of-sample validation when multiple trials are involved, the practitioner community has been slow in recognizing these problems.

Three papers from the academic community, among several others, have recently increased the awareness of the practitioner community about backtest overfitting and multiple trials when developing and evaluating trading strategies. However, the results in these papers deal only with the part of the problem related to overfitting and selection bias but not with the more important problem of the effect of changing market conditions on genuine strategies that overfit on persisting market conditions.

Methods for adjusting the Sharpe ratio, called the haircut Sharpe ratio, to account for multiple trials, are discussed in Harvey and Liu (2015). As we shall demonstrate with an example, these adjustments cannot guard against Type I error (false discoveries) when a genuine strategy is used but there is a change in market conditions. In Bailey et al. (2014), a different method is presented for determining the minimum backtest length required to assess the risk of overfitting, as a function of the number of trials involved in the development of a strategy. This method also fails to address the important problem of changing market conditions, as further acknowledged in Bailey et al. (2015). Both methods do not deal with the major cause of failures of genuine strategies that are overfitted on favorable market conditions but then fail due to changing market conditions although they are of value in the case of strategies developed via machine learning.

In Novy-Marx (2016), a differentiation is made between pure selection and overfitting bias, and their combination thereof, in the case of multiple signals. Critical values of the T-statistic are offered in the cases of pure selection and combinations of signals, called the best k-of-n strategy, as a function of the number of signals considered. The paper offers critical T-statistic values for correcting for data-mining bias due to overfitting and selection bias. While the results in the paper are interesting, one drawback is that they are based on generating random signals with real stock data from January 1995 to December 2014. Combining the random signals yields significant strategies with high values of the T-statistic. However, traders are actually interested in how popular trading rules perform in forecasting out-of-sample returns. It is also not entirely clear from this paper how Type II error is affected by correcting for data-mining bias using the critical values obtained from combining random signals. Discarding strategies that have high probability to perform well is an opportunity cost. After all, the job of a trader is to trade, not to perpetually analyze and evaluate strategies.

An important contribution of the results in the aforementioned three papers is that they have raised awareness about the impact of overfitting, selection bias and data-snooping, especially during a period of time when there is renewed interest in machine learning applications to trading strategy discovery. However, practitioners have known for long that genuine trading strategies fail primarily when market conditions change because they cannot maintain positive expectation. One reason for the slow adoption of quantitative methods in strategy evaluation by practitioners is due to their limited effectiveness, especially when these methods become another metric to guide the strategy development process. In such case, instead of minimizing data-mining bias, these quantitative methods actually become another factor that contributes to it.

In addition to the efforts by the academic community, practitioners of trading strategy evaluation have also attempted to deal with the problem of overfitting, selection bias and data-snooping with various ad-hoc quantitative methods.

In his popular book, Evidence-based Technical Analysis, David Aronson (2007) introduces the bootstrap and Monte Carlo permutation methods for generating sampling distributions used for statistical inference. In the case of the bootstrap, the null hypothesis is that the trading strategy mean return is 0 and, in the case of the Monte Carlo permutation, the null hypothesis is that the strategy possesses no intelligence in forecasting market returns. Aronson acknowledges that this approach is valid for independent trials and provides a set of heuristics for minimizing overfitting and selection bias, which are two components of data-mining bias (Harris, 2015). These heuristics involve limiting the number of tested rules, increasing sample size, considering correlated backtest results and limiting outliers and the variation of backtest results. However, these heuristics cannot limit the adverse impact from changing market conditions on genuine trading strategies, which is also the primary reason of failure.

Another method, called System Parameter Permutation (SPP), and its recent variant, System Parameter Randomization (SPR), was suggested by Walton (2014). This method involves applying a stochastic modeling approach for evaluating short-run and long-run performance estimates. The main advantage of SPP is that it does not rely on out-ofsample validation and this decreases data-snooping bias while it increases the power of the tests due to the larger sample. However, the first problem with SPP is that it requires selecting ex-ante a range of system parameters to subsequently vary and generate sampling distributions. This is problematic because data-mining bias does not only arise due to overfitting but also due to selection bias. In many cases selection bias is the main contributor to data-mining bias, for example when strategies have no parameters to vary. The second problem with SPP is that if it is repeatedly used under multiple trials, then it loses its effectiveness due to data snooping. The third and more serious problem is that all tests are conditioned on historical data and the probability of a Type I error is high under a change in market conditions. Therefore, SPP does not answer the following crucial question: How will strategy performance be affected if market conditions occur that are fundamentally different than those that were encountered during the analysis? As we shall see in Section 3.1 via an example, there is nothing SPP can do to determine a failure due to a massive change in market conditions.