THE ART OF BACKTESTING
26 AUGUST 2015 / DR MATTHEW KILLEYA AND CLAUS SIMON
Backtesting is at the heart of systematic investment. Done correctly, and it can recreate reality closely enough to identify systematic patterns which are likely to persist in the future. Patterns discovered by a robust backtest can be exploited to generate returns. But there are many subtle pitfalls to be avoided, and this is where the best researchers earn their salt.
INTRODUCTION
Systematic firms combine three key pillars: data, technology and people. Historical data related to financial instruments is critical to discovering and refining hypotheses. State of the art technology is essential if one is to extract meaningful information from decades and decades of such data – no serious statistical analysis can take place without serious computing. Although the third pillar is sometimes emphasised less than the first two(1) people are just as crucial as the first two pillars. High powered computing and reams of data are worth little without the skill and expertise to extract meaningful information from data. Misunderstanding the difference between information and data is a common mistake. Data is not information: once data has been processed, aggregated, visualised and transformed it becomes information. Data and information must be distinguished: if the processing, aggregation, transformation or visualisation processes are flawed then the resultant information is incorrect.(2)(3)
Any successful investment firm requires exceptional people in every area of its business, from operations and finance to research and software development. The best systematic firms hire teams of PhD scientists or quantitative researchers to analyse data.(4) Computing skills aside, scientific training is essential to extract information from data but an abundance of computing power and software makes it easy to do statistical analysis badly. The subtleties involved in analysing data correctly are so important, that much of the intellectual property of a systematic firm centres on the creation of clean and bias-free research pipelines.(5)
Let’s make this concrete through an example. Consider the general problem of building a portfolio of systematic strategies. Ultimately we want to discover stable relationships which have predicted financial markets and, most importantly, stable relationships which are likely to persist into the future.(6) But there are many nuances before we even attempt to build statistical models. In fact there are many more than we could ever hope to cover here. So let us focus on three examples.
THE PRICE IS RIGHT (…OR MAYBE NOT)
Many systematic firms have their origins in trading futures and forwards markets. In recent times, some — ourselves included — have broadened their horizons to single stock cash equities and beyond.(7) Even with basic instruments such as cash equities, choices that appear straightforward at first are more difficult than one might think. For example, one might assume that finding the price of Apple shares can’t be that hard? Let’s take a look at the market price of Apple since its December 1980 IPO.
One might assume that the historical share price is the correct price to analyse. Most of the time that is true. However, raw price series, such as the one above often are characterised by large sudden jumps (or drops) like the one seen in mid-2014. These drops in price do not correspond to corporate scandals or even investor losses, they are simply when a corporation decides to issue new shares to existing shareholders (effectively decreasing the price of the company’s minimum investible unit). In the case of Apple’s recent stock split, Apple, Inc. distributed 6 additional shares for each share held by investors. Given that the total equity didn’t change, the price per stock fell by a factor of 7. This means that a shareholder has the same amount of equity as before. The example also demonstrates why the raw share price is meaningless — taking it at face value would falsely imply a dramatic reduction in portfolio value.
Thus the “adjusted price series”, where these splits are factored out like below, are the correct returns to analyse. Stock splits are not the only events that need to be factored out. Stocks often pay dividends, and when they do, cash exits a business and goes to investors. The business’ equity has fallen and therefore shares, which are a claim on a business’ assets, must fall in price as well.(8) Dividends can be factored out similarly to splits, so raw prices may end up having multiple layers of adjustments.(9)
Even though corporate actions are rare in futures contracts, and futures do not entitle the holder to dividends, systematic firms typically still need to adjust historical futures prices in a similar way. Since futures contracts expire shortly after they cease to trade, firms must splice together individual expiring contracts if they wish to have a long, uninterrupted series of prices. Additional adjustments can be necessary too. Although futures exchanges establish clear specifications which outline the quantity, quality and type of a future’s deliverable, certain conditions may lead any two expirations to vary substantially. One recent example of this happened in the U.S. Treasury bond futures market, due to a lack of bond auctions 15 years prior.(10) In rare cases the actual underlying market may change, for example when the US gasoline market transitioned from regular gasoline via unleaded gasoline to the current reformulated gasoline contract.
WHICH UNIVERSE?
If we concentrate for the moment on stocks and shares, a second seemingly innocuous question is what stocks shall we analyse. What is the “universe“? A reasonable starting point might be to pick an index such as the FTSE 100 or the Dow Jones Industrial Average and use all the stocks in that index.(11) Given that index providers regularly alter the components of their indices, if we wish to have a static set of stocks, we must compile the constituents as of a particular date. Having fixed the set of stocks, one can proceed to build statistical models for that investible universe. However, the simple act of having fixed a date for the constituents can lead to deleterious biases, as the figure below demonstrates.
To understand the bias, imagine, for example, two hypothetical portfolios during the 2008 financial crisis:
- Portfolio A contains only companies that were part of the S&P 100 in 2007
- Portfolio B contains only companies that were part of the same index in 2009.
Portfolio A contains — among others — companies that went bust during the 2008 financial crisis. However, portfolio B does not. Clearly then, since A contains companies that “went to zero” in 2008, but B does not, A’s returns will have been much worse than B’s during the crisis. This thought experiment is exactly what the chart above demonstrates: the backtest with constituents chosen as of the most recent time performs the best and the one set at the start of the backtest performs worst. Why is this? Well for our long-only strategy, using the most recent index will exclude companies that underperformed, or even went bust during the duration of history. The problem is we wouldn’t have known which stocks would be in the index at the start of the backtest. A nasty forward-looking statistical bias has crept into our analysis.(12) We chose a universe that we wouldn’t have known at the time, and because of the way the index is constructed, we get stocks that have gone up.
The figure also shows the same strategy with a dynamic universe where the stock universe changes each month to reflect index changes. This is a fairer and truer selection and is essential for clean and bias-free backtests.(13) We see in the chart above that this effect is actually quite strong. Over the course of a 15 year period, the ’09 portfolio finishes over 10 percentage points above the ’07 portfolio.
There are other seemingly plausible ways to choose a universe, which introduce pernicious biases. For example, stocks with short histories can be a problem for statistical models which need enough data to learn patterns. A simple solution would be to discard them from our universe prior to backtest. However, we can’t do this because again we wouldn’t know which stocks are going to have short histories prior to entering positions. Such stocks tend to fall in price on average and those with long track records tend to have increased in price.(14) If a stock keeps going down then it is going to exit the index because its market capitalisation will be too low. If the price keeps going down eventually it reaches zero and the firm ceases to exist.
In practice, systematic firms use historical data to discover and harvest more exotic sources of alpha than traditional long only investment, such as momentum, value and quality.(15) These approaches often increase returns, but the additional complexity can mean these biases are subtler to detect still.
You might assume the arguments above don’t apply to macro contracts such as futures and forwards. Anyone trading currencies in the early nineties would disagree. Positions in French Francs, German Marks, Italian Lira and many others converged into positions in Euro.(16) They no longer exist, but they would have been in the portfolio over that period. During mid-2015, it seemed as if the Greek Drachma was about to make an unexpected reappearance…
THE PRICE IS RIGHT II: BUYING LOW AND SELLING HIGH
Let’s now backtest something slightly more complex than the long only example. Let’s consider a very simple daily rebalancing buy low/sell high strategy. The idea of buying undervalued assets and selling over-valued assets is as old as investment itself.
However, as we shall see, the clichéd old aphorism is true: the devil is in the detail. For this experiment we will:
- Use a universe of 100 stocks
- Buy 1 unit of the lowest priced stocks (those below the median price)
- Sell 1 unit of the highest prices (those above the median price)
- Re-balance positions every day
- Give the strategy a fixed risk allocation (in this case 10% volatility)
The figure below shows the performance of the strategy.
The performance of the strategy looks rather good.(17)A key working assumption of a quantitative analyst is that if something simple (and frankly, in this case, nonsense) looks surprisingly good, there has to be a mistake. So what have we done wrong? Returns are from the adjusted prices we established as necessary in the first section. There is no pollution from splits and dividends in the evaluation of the strategy’s performance. What about positions in these stocks? Buy low/sell high, based on price. What price? The same price we used for returns. Herein lies the problem. Adjustments make sense for returns, but not for signals based on absolute price level. In fact, stocks that have gone up through history will tend to have had several splits, as directors try to keep them within the ideal investible range. Applying splits back through history in these cases results in a low adjusted price historically. Thus our strategy tends to go long stocks that subsequently go up. This is fine, except that we can never know about future splits, and the low priced stocks are entirely a consequence of these splits. Thus the money we make in our backtest is a mirage.(18)
IN CONCLUSION
Cheap, unlimited computing power and vast databases of historical data have kick-started a technological revolution in investment that mirrors the one we all see in many other aspects of our lives. But computers are only as smart as the people who program them and, as we have seen, there are many pitfalls.(19) Knowledge of these pitfalls is part of the collection of intellectual property that firms like ours have: it is where our “edge” lies. Investors should be reassured that highly qualified scientists, complete with a healthy scepticism for the sensational, concern themselves with these issues and address them on their behalf.
We have just scratched the surface in terms of backtest biases. Overfitting, including the right contracts, and cognitive biases, are other areas in which researchers can obtain superficially plausible results that will fail to materialise when the models contact reality. The list is long(20)and no doubt we will write about them in future posts.