‘Rogue Algorithms’ And The Dark Side Of Big Data by Knowledge@Wharton
Most of us, unless we’re insurance actuaries or Wall Street quantitative analysts, have only a vague notion of algorithms and how they work. But they actually affect our daily lives by a considerable amount. Algorithms are a set of instructions followed by computers to solve problems. The hidden algorithms of Big Data might connect you with a great music suggestion on Pandora, a job lead on LinkedIn or the love of your life on Match.com.
These mathematical models are supposed to be neutral. But former Wall Street quant Cathy O’Neil, who had an insider’s view of algorithms for years, believes that they are quite the opposite. In her book, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, O’Neil says these WMDs are ticking time-bombs that are well-intended but ultimately reinforce harmful stereotypes, especially of the poor and minorities, and become “secret models wielding arbitrary punishments.”
Models and Hunches
Algorithms are not the exclusive focus of Weapons of Math Destruction. The focus is more broadly on mathematical models of the world — and on why some are healthy and useful while others grow toxic. Any model of the world, mathematical or otherwise, begins with a hunch, an instinct about a deeper logic beneath the surface of things. Here is where the human element, and our potential for bias and faulty assumptions, creeps in. To be sure, a hunch or working thesis is part of the scientific method. In this phase of inquiry, human intuition can be fruitful, provided there is a mechanism by which those initial hunches can be tested and, if necessary, corrected.
O’Neil cites the new generation of baseball metrics (a story told in Michael Lewis’s Moneyball) as a healthy example of this process. Moneyball began with Oakland A’s General Manager Billy Beane’s hunch that using performance metrics such as runs batted in (RBIs) were overrated, while other more obscure measures (like on base percentage) were better predictors of overall success. Statistician Bill James began crunching the numbers and putting together models that Beane could use in his decisions about which players to acquire and hold onto, and which to let go.
While sports enthusiasts love to debate the issue, this method of evaluating talent is now widely embraced across baseball, and gaining traction in other sports as well. The Moneyball model works, O’Neil says, for a few simple reasons. First, it is relatively transparent: Anyone with basic math skills can grasp the inputs and outputs. Second, its objectives (more wins) are clear, and appropriately quantifiable. Third, there is a self-correcting feedback mechanism: a constant stream of new inputs and outputs by which the model can be honed and refined.
These WMDs are ticking time-bombs that are well-intended but ultimately reinforce harmful stereotypes, especially of the poor and minorities.
Where models go wrong, the author argues, all three healthy attributes are often lacking. The calculations are opaque; the objectives attempt to quantify that which perhaps should not be; and feedback loops, far from being self-correcting, serve only to reinforce faulty assumptions.
WMDs on Wall Street
After earning a doctorate in mathematics at Harvard and then teaching at Barnard College, O’Neil got a job at the hedge fund D.E. Shaw. At first, she welcomed the change of pace from academia and viewed hedge funds as “morally neutral — scavengers in the financial system, at worst.” Hedge funds didn’t create markets like those for mortgage-backed securities, in which complicated derivatives played a key part in the financial crisis — they just “played in them.”
But as the subprime mortgage crisis spread, and eventually engulfed Lehman Bros., which owned a 20% stake in D.E. Shaw, the internal mood at the hedge fund “turned fretful.” Concern grew that the scope of the looming crisis might be unprecedented — and something that couldn’t be accounted for by their mathematical models. She eventually realized, as did others, that math was at the center of the problem.
The cutting-edge algorithms used to assess the risk of mortgage-backed securities became a smoke screen. Their “mathematically intimidating” design camouflaged the true level of risk. Not only were these models opaque; they lacked a healthy feedback mechanism. Importantly, the risk assessments were verified by credit-rating agencies that collected fees from the same companies that were peddling those financial products. This was a mathematical model that checked all the boxes of a toxic WMD.
Disenchanted, O’Neil left Shaw in 2009 for RiskMetrics Group, which provides risk analysis for banks and other financial services firms. But she felt that people like her who warned about risk were viewed as a threat to the bottom line. A few years later, she became a data scientist for a startup called Intent Media, analyzing web traffic and designing algorithms to help online companies maximize e-commerce. O’Neil saw disturbing similarities in the use of algorithms in finance and Big Data.
In both worlds, sophisticated mathematical models lacked truly self-correcting feedback. They were driven primarily by the market. So if a model led to maximum profits, it was on the right track. “Otherwise, why would the market reward it?” Yet that reliance on the market had produced disastrous results on Wall Street in 2008. Without countervailing analysis to ensure that efficiency was balanced with concern for fairness and truth, the “misuse of mathematics” would only accelerate in hidden but devastating ways. O’Neil left the company to devote herself to providing that analysis.
Misadventures in Education
Ever since the passage of the No Child Left Behind Act in 2002 mandating expanded use of standardized tests, there has been a market for analytical systems to crunch all the data generated by those tests. More often than not, that data has been used to try to identify “underperforming” teachers. However well-intentioned, O’Neil finds these models promise a scientific precision they can’t deliver, victimizing good teachers and creating incentives for behavior that does nothing to advance the cause of education.
In 2009, the Washington D.C. school system implemented a teacher assessment tool called IMPACT. Using a complicated algorithm, IMPACT measured the progress of students and attempted to isolate the extent to which their advance (or decline) could be attributed to individual teachers. The lowest-scoring teachers each year were fired — even when the targeted teachers had received excellent evaluations from parents and the principal.
O’Neil examines a similar effort to evaluate teacher performance in New York City. She profiles a veteran teacher who scored a dismal 6 out of 100 on the new test one year, only to rebound the next year to 96. One critic of the evaluations found that, of teachers who had taught the same subject in consecutive years, 1 in 4 registered a 40-point difference from year to year.
The cutting-edge algorithms used to assess the risk of mortgage-backed securities became a smoke screen.
There is little transparency in these evaluation models, O’Neil writes, making them “arbitrary, unfair, and deaf to appeals.” Whereas a company like Google has the benefit of large sample sizes and constant statistical feedback allowing them to immediately identify and correct errors, teacher evaluation systems attempt to render judgments based on annual tests of just a few dozen students. Moreover, there is no way to assess mistakes. If a good teacher is wrongly fired and goes on to be a great teacher at another school, that “data” is never accounted for.
In the Workplace
Teachers are hardly alone. In the face of slow growth, companies are looking everywhere for an edge. Because personnel decisions are among the most significant for a firm, “workforce management” has become big business – in particular, programs that screen potential employees and promise to take “the guesswork” out of hiring. Increasingly, these programs utilize personality tests in an effort to automate the hiring process. Consulting firm Deloitte estimates that such tests are used on 60% to 70% of prospective employees in the U.S., nearly double the figure from five years ago.
The prevalence of personality tests runs counter to research that consistently ranks them as poor predictors of future job performance. Yet they generate raw data that can be plugged into algorithms that provide an illusion of scientific precision, all in the service of an efficient hiring process. But as O’Neil writes, these programs lack transparency and rejected employees rarely know why they’ve been flagged, or even that they’ve been flagged at all. They also lack a healthy feedback mechanism — a means of identifying errors and using those mistakes to refine the system.
Once on the job, a growing number of workers are subject to another iteration of Big Data, in the form of scheduling software. Constant streams of data — everything from the weather to pedestrian patterns — can be used, for example, to optimize staffing at a Starbucks café. A New York Times profile of a single mother working her way through college as a barista explored how the new technology can create chaos, especially in the lives of low-income workers. According to U.S. government data, two-thirds of food service workers consistently get short-term notice of scheduling changes.
This instability can have far-reaching and insidious effects, O’Neil says. Haphazard scheduling can make it difficult to stay in school, keeping vulnerable workers in the oversupplied low-wage labor pool. “It’s almost as if the software were designed expressly to punish low-wage workers and keep them down,” she writes. And chaotic schedules have ripple effects on the next generation as well. “Young children and adolescents of parents working unpredictable schedules,” the Economic Policy Institute finds, “are more likely to have inferior cognition and behavioral outcomes.”
Following the exposé in the Times, legislation was introduced in Congress to rein in scheduling software, but didn’t go anywhere.
Crime and Punishment
Often, as with both educational reform and new hiring practices, the use of Big Data initially comes with the best of intentions. Recognizing the role of unconscious bias in the criminal justice system, courts in 24 states are using computerized models to help judges assess the risk of recidivism during the sentencing process. By some measures, according to O’Neil, this system represents an improvement. But by attempting to quantify and nail down with precision what are at root messy human realities, she argues, they create new problems.
A new, pseudoscientific generation of scoring has proliferated wildly. … Yet unlike FICO scores, they are “arbitrary, unaccountable, unregulated, and often unfair.”
One popular model includes a lengthy questionnaire designed to pinpoint factors related to the risk of recidivism. Questions might inquire about previous police incidents; and, given how much more frequently young black males are stopped by police, such a question can come to be a proxy for race, even while the intention is to reduce prejudice. Additional questions, such as whether the respondent’s friends or relatives have criminal records, would elicit an objection from a defense attorney if raised during a trial, O’Neil points out. But the opaqueness of these complicated risk models shields them from proper scrutiny.
Another trend is the use of crime prediction software to anticipate crime patterns, and adjust police deployment accordingly. But one underlying problem with WMDs, the author argues, is that they essentially become data hungry, confusing more data with better data. And in the case of crime prediction software, even though the stated priority is to prevent violent and serious crime, the data generated by petty “nuisance” crimes can overwhelm and essentially prejudice the system. “Once the nuisance data flows into a predictive model, more police are drawn into those neighborhoods, where they’re more likely to arrest more people.” These increased arrests seem to justify the policy in the first place, and in turn feed back into the recidivism models used in sentencing: a destructive and “pernicious feedback loop,” as O’Neil characterizes it.
The Cancer of Credit Scores
In the wake of a financial crisis that was at the very least exacerbated by loose credit, banks are understandably trying to be more rigorous in their assessment of risk. An early risk assessment algorithm, the well-known FICO score, is not without its problems; but for the most part, O’Neil writes, it is an example of a healthy mathematical model. It is relatively transparent; it is regulated; and it has a clear feedback loop. If default rates don’t jibe with what the model predicts, credit agencies can tweak them.
In recent years, however, a new, pseudoscientific generation of scoring has proliferated wildly. “Today we’re added up in every conceivable way as statisticians and mathematicians patch together a mishmash of data, from our zip codes and internet surfing patterns to our recent purchases.” Crunching this data generates so-called “e-scores” used by countless companies to determine our creditworthiness, among other qualities. Yet unlike FICO scores, they are “arbitrary, unaccountable, unregulated, and often unfair.”
A huge “data marketplace” has emerged in which credit scores and e-scores are used in a variety of applications, from predatory advertising to hiring screening. In this sea of endless data, the author contends, the line between legitimate and specious measures has become hopelessly blurred. As one startup proclaims on its website, “All data is credit data.”
It’s all part of a larger process in which “we’re batched and bucketed according to secret formulas, some of them fed by portfolios loaded with errors.” According to the Consumer Federation of America, e-scores and other data are used to slice and dice consumers into “microsegments” and to target vulnerable groups with predatory pricing for insurance and other financial products.
And as companies gain access to GPS and other mobile data, the possibilities for this kind of micro-targeting will only grow exponentially. As insurance companies and others “scrutinize the patterns of our lives and our bodies, they will sort us into new types of tribes. But these won’t be based on traditional metrics, such as age, gender, net worth, or zip code. Instead, they’ll be behavioral tribes, generated almost entirely by machines.”
Reforming Big Data
In her conclusion, O’Neil argues we need to “disarm” the Weapons of Math Destruction, and that the first step for doing so is to conduct “algorithmic audits” to unpack the black boxes of these mathematical models. They are, again, opaque and impenetrable by design, and often protected as proprietary intellectual property.
Toward this end, Princeton University has launched WebTAP, the Web Transparency and Accountability Project. Carnegie Mellon and MIT are home to similar initiatives. In the end, O’Neil writes, we must realize that the mathematical models which have penetrated almost every aspect of our lives “are constructed not just from data but from the choices we make about which data to pay attention to… These choices are not just about logistics, profits, and efficiency. They are fundamentally moral.”