Baseball, Bias And Decision-Making
Wharton’s Etan Green discusses his research on baseball umpires and decision-making.
Bias is inherent in decision-making because human beings, despite their best efforts, will always bring their point of view into the game. That’s certainly the case with baseball umpires. Research by Etan Green, a Wharton professor of operations, information and decisions, shows how umpires adjust their calls based on what information they think they know when a play is in motion. The research, “What Does it Take to Call a Strike? Three Biases in Umpire Decision Making,” which was co-authored with Stanford’s David P. Daniels, has implications for the accuracy of decision-making in realms outside of sports. Green explained to Knowledge@Wharton why “opportunities for statistical discrimination are everywhere.”
An edited transcript of the conversation follows.
Knowledge@Wharton: Can you give us an overview of your research?
Etan Green: Generally, what I’m interested in is decision-making by experts, particularly in realms for which there are predictions available from machines, machine-based models, and algorithms. In the case of umpires in baseball, there is data from stereoscopic cameras behind home plate of every major league ballpark that we use to benchmark the calls that umpires make. This is a great setting for studying decision-making by experts because we have experts who are supposed to abide by a very specific decision rule. The pitcher throws a pitch. If the pitch is in this imaginary box, the official strike zone, defined by the width of home plate on the ground and the batter’s stance, then the umpire is supposed to call a strike. Otherwise, he is supposed to call a ball.
What we do is use these data from the stereoscopic cameras, which take a sequence of images of every pitch from its release from the pitcher’s hand until it crosses the region above home plate, to basically observe to what extent the umpire abides by this decision rule to make his calls, based solely on the location of the pitch.
I think the most interesting thing that comes out of the data is this deviation from that benchmark, which is very systematic. There is something in baseball called the count. The count keeps track of the sequence of pitches between a pitcher and a batter over the course of an at-bat. If the count reaches four balls, that’s good for the batter. He walks. If it reaches three strikes, that’s good for the pitcher. The batter strikes out.
“The argument here is that what the umpire is doing is trading off accuracy for bias. Or rather, he’s trading off bias for accuracy.”
What you see is that instead of the umpire just using the location of the pitch to make his calls, pitches at the same location are sometimes called balls or strikes, depending on the count. In particular, the strike zone expands dramatically when the count favors the batter. When the count favors the batter, the umpire responds by favoring the pitcher, and vice versa. When the count favors the pitcher, the umpire responds by favoring the batter.
… Basically, you can think about a pitch that crosses, say, the top boundary of the official strike zone. This pitch is in what I’ll call a baseline count, the count at the beginning of the at-bat, when there are zero balls and zero strikes. An umpire calls this pitch a strike 50% of time, and he calls it a ball 50% of the time. You can think of it as being a difference between a ball and a strike.
But when the count has three balls and zero strikes, when it strongly favors the batter, this pitch is almost always called a strike. The reverse is true in the opposite count, with zero balls and two strikes. The same pitch at the same location is almost always called a ball.
Knowledge@Wharton: Why does this happen?
Green: …There are potentially a number of stories that can explain this result. I’ll tell you about a particularly interesting and counter-intuitive one. The argument here is that what the umpire is doing is trading off accuracy for bias. Or rather, he’s trading off bias for accuracy. He’s being purposefully biased, consciously or unconsciously so. He’s varying the strike zone that he enforces with the count. He’s not making his decisions based solely on the location of the pitch. But the argument is, this actually helps him make more accurate calls.
Why is this the case? Imagine yourself as an umpire. You’re squatting behind the catcher; you’re looking out over his head towards the pitcher. The pitcher winds up and throws a 90-plus mile per hour pitch. It’s there in an instant. It has some lateral movement, some vertical movement. You have to decide whether this pitch is inside or outside some imaginary box. It’s an incredibly difficult problem, and if you relied only on your observation of the location of the pitch, you’d probably make mistakes on a regular basis, frequently when the pitch is close. It would be hard to say whether it was just inside the strike zone or just outside the strike zone.
But fortunately for you, you have other information at your disposal. You have expectations that you’ve built up over many years of being a professional umpire, expectations about where the pitcher is going to throw in a certain count and whether the batter is going to swing. For instance, you might reasonably expect that when the count is three balls and zero strikes, that is, when it favors the batter, that the pitcher is going to try to throw a strike. So, if the pitch is close and you’re unsure whether it was just inside the strike zone or just outside the strike zone, you may err on the side of calling a strike, the pitch that you expect.
Now, think about what happens in an 0-2 count. In this count, you expect that the batter is going to swing at anything close, because if he doesn’t he runs the chance of striking out, whereas he can prolong the at-bat if he fouls the pitch off, for instance. Imagine you see a pitch that appears close to you, but the batter chooses not to swing. How can you rationalize that decision? Well, you can rationalize it by saying that he observed something that you didn’t; that his vantage point was such that he believed the pitch to be a ball. And so, you might err on the side of calling a ball.
This Bayesian updating, this basically rational way of processing other information that you have, creates this trade-off between bias and accuracy. It helps the umpires become more accurate at the cost of having them systematically change the strike zone that they enforce with this variable, the count, that has nothing to do with their directive.
Knowledge@Wharton: What would you say a business practitioner should take away from your research?
Green: I think what umpires are doing is statistically discriminating. They have a directive to make their calls based solely on the location of the pitch. But that’s very difficult to do. It’s very hard to observe the exact location every time. What they do instead is say, “Well, I have this other information that’s correlated with the location of the pitch. It can help me, on average, make more accurate calls.” As I said before, they basically trade off bias for accuracy.
“Opportunities for statistical discrimination are everywhere.”
Opportunities for statistical discrimination are everywhere. They’re everywhere in the workplace, in particular, in the hiring process. For instance, when we hire, we have a benchmark that sounds very similar to the umpire’s directive. We want to hire the best person, the person who’s going to do the best at the job, who is going to be the best fit. But it’s often hard in the interview process, looking at a CV or even interviewing them in person, to decide who’s the best. How good is this person? How good is this person going to be in the job? So, we may rely on other factors. Factors that are either implicitly or explicitly banned, that we shouldn’t be using, perhaps. But factors that we believe, perhaps rightly, as in the case of the umpires, or even erroneously, to give us information about this person’s fit. If we’re right, we’re going to get a little more accuracy, but it’s going to come at the cost of bias. It’s going to come at the cost of being able to systematically predict who we hire, based on factors that have nothing to do, at least directly, with the dimension that we’re trying to hire along.
Knowledge@Wharton: What are you going to look at next?
Green: Baseball is an opportunity to use machine-based models, these cameras, to say how good of a job umpires are doing. But I think there are a lot of interesting cases in which the decisions that individuals make, that experts make, can be informed by algorithms, by machine-based predictions. One of the things that particularly interests me now is making predictions about the election. It’s particularly timely. I think a lot of us are interested in the probability that Hillary Clinton is going to win and the probability that Donald Trump will be our next president. One place you may go to get information about this is FiveThirtyEight, Nate Silver’s website. One thing that statistician Nate Silver is doing this election season that he hasn’t done in previous election seasons is providing multiple models. In the past, he told you the probability that Hillary Clinton would win is 77%. Now he’s telling you, “If you believe this model, it’s 72%. If you believe that model, it’s 84%.” And sometimes, there’s really quite a deviation between these two models.
What are these two models? Basically, they’re making different assumptions about the world. Your decision as to which model you listen to is really a decision about what you believe the data-generating process to be. What do you believe the world to look like? In particular, there’s one model that says we should only listen to the polls. We should only listen to what people are saying right now. There’s another model that says there are lots of predictors, economic indicators for instance, that historically have been very predictive of election outcomes. So, we should listen to those as well.
Your decision about which model to listen to or how to balance these two pieces of information comes down to your belief about whether this election season is totally different from the past, in which case you should only listen to the polls, or if you believe that this is just another draw in some stable distribution that is similar to everything else that’s come before.
What I’m interested in generally is: How can we frame questions? What types of information can we give people to make them think that this moment, the present, is just like the past, and that the past is a good predictor of the present? How can we frame questions to get people to think, “Actually, the process is not stationary at all. This moment is unique in time.”