Why you should care
Contrary to current fads, data mining may not be knowledge discovery, but little more than noise discovery.
Ten years ago, British-American author and entrepreneur Chris Anderson made a provocative argument. “The new availability of huge amounts of data, along with the statistical tools to crunch these numbers,” means that “science can advance even without coherent models, unified theories or really any mechanistic explanation at all,” he argued. Simply put, “correlation supersedes causation,” he wrote in an article for Wired, where Anderson — now CEO of drone software firm 3D Robotics — was then editor-in-chief.
The revolution Anderson envisioned is playing out, moving forward full speed ahead, driven by artificial intelligence. Data miners are ransacking Big Data looking for patterns and making decisions based on these discovered patterns. But just how useful are these strategies? And what are their limitations?
Let’s look at an example where the model’s success or failure could in some cases be the difference between life and death. Recently, I helped a data-savvy police department in Minnesota data-mine the Facebook accounts of local residents to see if surges and slumps in the use of certain words might be helpful in predicting criminal activity. They started their investigation by identifying the 100 most popular nouns, 50 most popular adjectives and 50 most popular adverbs in the English language.
Then they collected daily data for 10 weeks on the frequency with which each of these 200 words were used in status updates on Facebook in their jurisdiction and the number of burglaries committed the next day. All the data were scaled to equal 100 at the start of the study, so a value of 101 means 1 percent more than initially, and 99 means 1 percent less. They found that the two most helpful words for predicting burglaries were “day” and “most.” Figure 1 shows the close correspondence between actual burglaries and those predicted by their computer model using these two words. The correlation between predicted and actual burglaries was 0.96.
The number of burglaries more than doubled during this 10-week period, and the Facebook-word model accurately predicted this increase and most of the smaller ups and downs. We might leave it at that since, according to Anderson, we supposedly don’t need coherent models, unified theories or any mechanistic explanation for why these two specific words are so useful in predicting burglaries.
Or we might invent explanations — or what’s called knowledge discovery. Instead of testing theories with data, we let the data discover theories. Perhaps the word “day” shows up because people are communicating about when to commit the burglary. The word “most” might be used to communicate information about the target, tools or something else. Or perhaps “day” and “most” are code words used by local burglars, and the police have broken the code.
If useless data can make accurate predictions, then accurate predictions are not enough to demonstrate that the data are useful.
OK, time for a confession: I didn’t work with any police department on this. And while I did use 200 popular nouns, adjectives and adverbs, I didn’t use real burglary data or real Facebook data. I created all the data using a computer’s random number generator. The burglary data and the data for each word started at 100 and then followed a random walk by going up or down each imaginary day, using random, independent draws from a normal distribution. There is no real relationship between the data for each word and for the “burglaries” because they were all generated independently.
But there were several coincidental correlations between the made-up word data and the made-up burglary data. The two words with the strongest correlation just happened to have the randomly assigned labels “day” and “most.” So what’s the lesson?
When a data miner sifts through enough useless data, some strong, but totally coincidental, statistical patterns are bound to be discovered. If useless data can make accurate predictions, then accurate predictions are not enough to demonstrate that the data are useful.
Figure 2 shows how this Facebook word model worked over the next 10 weeks. The predicted number of burglaries trended down, while the actual number of burglaries trended up. The day-to-day fluctuations in predicted and actual burglaries were sometimes in the same direction, other times the opposite. Overall, the correlation between the model’s predictions and actual burglaries was 0.04. The model was essentially useless, which is not a surprise because all of the fluctuations in burglaries and words were independently generated random numbers.
I am not saying that Facebook data are useless or, more generally, that statistical patterns are worthless. What I am saying is that data mining may not be knowledge discovery, but noise discovery.
This phony Facebook/burglary model is distressingly similar to data-mining models being discovered and relied on all over the world. Facebook words have been used to price auto insurance. Google search queries have been used to predict flu outbreaks. Twitter words have been used to predict stock prices. Sadly, most data-mined social media models are no more useful than using random numbers to predict burglaries.
Gary Smith is the Fletcher Jones professor of economics at Pomona College in Claremont, California. His most recent book is The AI Delusion, Oxford University Press, 2018.