The Setup (Revisited)
In Part 1 of this series we discussed the background and problem setup for how one can apply deep learning to predicting whether a stock will outperform the median performance of all stocks over a one-year period. To make this prediction, we feed the model historical company fundamental and price data. By fundamental data we mean information that can be found in a company’s financial statements. Because we use a recurrent neural network (RNN), on each time step (month) the model can make a prediction using (if needed) all of the historical price and fundamental data up until that time. In Part 1 we used the following diagram to visualize this setup:
In the above, the model uses the data in the columns “Fundamental Data at Time t” (called training inputs) to predict the outcome in the column “Outcomes at Time t+12” (called training targets). Here, “+1” means the company outperformed the median performance for the period t to t+12, and “-1” means it did not. The actual output of the model is a probability that the outcome will be +1.
The Data Universe
The data universe we consider includes any stock that has been public for at least 36 months and traded on the NYSE, NASDAQ, or AMEX exchanges between January 1970 and December 2015. However, non-US-based companies, companies in the financial sector, and companies with market capitalizations that, when adjusted by the S&P500 Index Price to January 2010, are less than 100 million US dollars are excluded from the dataset.
[drizzle]It is common practice in quantitative studies to exclude financials. Researchers generally give the rationale that balance sheet leverage at a financial company has a very different meaning than at an operating company. Whether or not that is a good reason, we do it here for comparability to existing research.
The dataset spans forty-five years (540 months) from 1970 to 2015. For each month, there are approximately 1,300 to 5,000 companies. Because many companies have come and gone over the last forty-five years, the entire dataset actually represents approximately 10,000 individual companies.
Structuring the Data for Deep and Recurrent Neural Networks
In the above table, each row in the data represents a step in the historical sequence of a company’s evolution, with both inputs and target outputs for training the model. But what specific inputs should we use and how should they be represented?
In a typical quantitative investment project, this is the stage where factor (feature) engineering begins. We might investigate ratios that are derived from fundamental data, such as price-to-earnings, book-to-market, return-on-equity, return-on-invested-capital, and debt-to-equity. We might explore factors others have shown to work, factors we believe will work (say, for economic reasons), and perhaps what successful investors and analysts think is predictive. We might even employ “feature selection algorithms” to empirically test which factors have the most predictive power.
With this project, using deep learning, we took a different approach. One of the appealing qualities of deep learning models is their ability to discover successful features directly from raw data. That is, instead of training a model on engineered ratios like price-to-book, price-to-earnings, etc., we simply provide the model with earnings, prices, book values, and other fundamental measures, and allow the model to figure out how to combine these measures mathematically in a way that produces the best result. The advantage of this approach is that: (1) we don’t bias the model; and (2) the deep learning model may find features that we would never otherwise consider and discover.
In addition, by using recurrent neural networks, we allow the learning process to discover the time horizon for which various pieces of company fundamentals are most relevant. As an example, consider the concept of return-on-equity. With a more traditional approach, we might choose a factor that is defined by trailing twelve-month operating income (as reported in the income statement) divided by total shareholders’ equity (as reported on the balance sheet). However, what if it turned out that what is really indicative of a company’s inherent value is not just last year’s return-on-equity but the average return-on-equity over several years and/or the consistency of return-on-equity over a different period of years? In the past, we would have had to construct hundreds of different factors, viewed over different time periods, to attempt to answer this type of question. Now, with deep learning, we can approach these questions more directly.
The Source Data, Preprocessing, and Normalization
Despite the discussion above, it should be made clear that we do not feed the deep learning models the value for income statement and balance sheet items exactly as they are found in a company’s public filings. We still do a lot of preprocessing and data normalization to make companies’ fundamentals easily digestible by the learning process and to prevent the model from memorizing the profiles of specific companies during training.
The source of our data for this project is Standard & Poor’s Compustat database. From that, we selected the following source fields.
We then pre-processed the source data fields in a series of steps to generate the model input features. There are five categories of model input features: momentum features, valuation features, normalized fundamental features, year-over-year change in fundamental features, and missing value indicator features.
Because we are attempting to predict the relative performance of a stock, it seems reasonable to provide the model with the relative past performance of the stock over varying time intervals. To do this, we calculate the percentile ranking, among all companies within the same month (time-step), of the trailing 1-, 3-, 6-, and 9-month stock price change adjusted for splits. This type of feature is typically referred to as “momentum.”
Again, because we are attempting to learn to predict the relative performance of a stock, it also seems reasonable to provide the model with the relative valuation of the stock as input. For this, we use two very common valuation metrics—Book-to-Market and Earnings Yield (which is the reciprocal of Price-to-Earnings). The raw form of these features are calculated as follows:
From these two derived values we then compute their respective relative percentile rankings and use these values, along with the raw values, as feature inputs to the model during training.
Normalized Fundamental Features
As mentioned above, we don’t feed the fundamental fields directly, as reported in financial statements, to the model. Instead we normalize them. There are two reasons for this. The first is that neural network learning, for technical reasons related to how the training algorithms work, simply behaves better when inputs are nicely distributed within a domain. The second