If you are out to describe the truth, leave elegance to the tailor.— Albert Einstein

Machine learning is everywhere now, from self-driving cars to Siri and Google Translate, to news recommendation systems and, of course, trading.

In the investing world, machine learning is at an inflection point. What was bleeding edge is rapidly going mainstream. It’s being incorporated into mainstream tools, news recommendation engines, sentiment analysis, stock screeners. And the software frameworks are increasingly commoditized, so you don’t need to be a machine learning specialist to make your own models and predictions.

If you’re an old-school quant investor, you may have been trained in traditional statistics paradigms and want to see if machine learning can improve your models and predictions. If so, then this primer is for you!

We’ll review the following items in this piece:

- How machine learning differs from statistical methods and why it’s a big deal
- The key concepts
- A couple of nuts-and-bolts examples to give a flavor of how it solves problems
- Code to jumpstart your own mad science experiments
- A roadmap to learn more about it.

Even if you’re not planning to build your own models, AI tools are proliferating, and investors who use them will want to know the concepts behind them. And machine learning is transforming society with huge investing implications, so investors should know basically how it works.

Let’s dive in!

**Why machine learning?**

Machine learning can be described as follows:

- A highly empirical paradigm for inference and prediction
- Works pretty well with many noisy predictors
- Can generalize to a wide range of problems.

In school, when we studied modeling and forecasting, we were probably studying statistical methods. Those methods were created by geniuses like Pascal, Gauss, and Bernoulli. Why do we need something new? What is different about machine learning?

Short answer: Machine learning adapts statistical methods to get better results in an environment with much more data and processing power.

In statistics, if you’re a Bayesian, you start from a prior distribution about what the thing you’re studying probably looks like. You identify a parametric function to model the distribution, you sample data, and you estimate the parameters that yield the best model given the sample distribution and the prior.

If you’re not a Bayesian, you don’t think about a prior distribution, you just fit your model to the sample data. If you’re doing a linear regression, you specify a linear model and estimate its parameters to minimize the sum of squared errors. If you believe any of your standard errors, you take an opinionated view that the underlying data fit a linear model plus a normally distributed error. If your data violate the assumptions of ordinary least squares, those error estimates are misleading or meaningless.

I think we are all Bayesians, just to varying degrees. Even if we are frequentists, we have the power to choose the form of our regression model, but no power to escape the necessity of choice. We have to select a linear or polynomial model, and select predictors to use in our model. Bayesians take matters a step further with a prior distribution. But if you don’t assume a prior distribution, you are nevertheless taking an implied random distribution as a prior. If you assume all parameters are equally likely, your prior for the parameters is a uniform distribution. In the words of Rush, “If you choose not to decide, you still have made a choice.”

But if OLS (ordinary least squares) is BLUE (the best linear unbiased estimator), why do we need machine learning?

When we do statistics, we most often do simple linear regression against one or a small number of variables. When you start adding many variables, interactions, nonlinearities, you get a combinatorial explosion in the number of parameters you need to estimate. For instance, if you have 5 predictors and want to do a 3rd-degree polynomial regression, you have 56 terms, and in general (deg+nn)\binom{deg + n}{n}(?n?deg+n??).

In real-world applications, when you go multivariate, even if you are well supplied with data, you rapidly consume degrees of freedom, overfit the data, and get results which don’t replicate well out of sample. The curse of dimensionality multiplies outliers, and statistical significance dwindles.

The best linear unbiased estimator is just not very good when you throw a lot of noisy variables at it. And while it’s straightforward to apply linear regression to nonlinear interactions and higher-order terms by generating them and adding them to the data set, that means adding a lot of noisy variables. Using statistics to model anything with a lot of nonlinear, noisy inputs is asking for trouble. So economists look hard for ‘natural experiments’ that vary only one thing at a time, like a minimum wage hike in one section of a population.

When the discipline of statistics was created, Gauss and the rest didn’t have the computing power we have today. Mathematicians worked out proofs, closed form solutions, and computations that were tractable with slide rules and table lookups. They were incredibly successful considering what they had to work with. But we have better tools today: almost unbounded computing resources and data. And we might as well use them. So we get to ask more complicated questions. For instance, what is the nonlinear model that best approximates the data, where ‘best’ means it uses the number of degrees of freedom that makes it optimally predictive out-of-sample?

More generally, how do you best answer any question to minimize out-of-sample error? Machine learning asks, what are the strongest statements I can make about some data with the maximum of cheap tricks, in the finest sense of the word. Given a bunch of data, what is the best fitting nonlinear smoothing spline with *x* degrees of freedom or knots? And how many knots / degrees of freedom give you the best bang for the buck, the best tradeoff of overfitting v. underfitting?

In machine learning, we do numerical optimizations, whereas in old-school statistics we solved a set of equations based on an opinionated view of what ‘clean’ data look like.

Those tricks, with powerful CPUs and GPUs, new algorithms, and careful cross-validation to prevent overfitting, are why machine learning feels like street–fighting statistics. If it works in a well-designed test, use it, and don’t worry about proofs or elegance.

Pure statisticians might turn up their nose at a perceived lack of mathematical elegance or rigor in machine learning. Machine learning engineers might reply that statistical theory about an ideal random variable may be a sweet science, but in the real world, it leads you to make unfounded assumptions that the underlying data don’t violate the assumptions of OLS, while fighting with one hand tied behind your back.

In statistics, a lot of times you get not-very-good answers and know the reason: your data do not fit the assumptions of OLS. In machine learning, you often get better answers and you don’t really know why they are so good.

There is more than one path to enlightenment. One man’s ‘data mining’ is another’s reasoned empiricism, and letting the data do the talking instead of leaning on theory and a prior about what data should look like.

**A 30,000 foot view of machine learning algorithms**

In statistics, we have descriptive and inferential statistics. Machine learning deals with the same problems, uses them to attack higher-level problems like natural language, and claims for its domain any problem where the solution isn’t programmed directly, but is mostly learned by the program.

*Supervised learning* – You have *labeled* data: a sample of ground truth with *features* and