Would skipping breakfast kill you? Not necessarily, but confusing correlation and causation might. The data show that skipping breakfast is indeed associated with heart disease, but that doesn’t necessarily mean that breakfast deserves its reputation as the most important meal of the day.
Harvard University medical researchers have concluded that American men between the ages of 45 to 82 who skip breakfast showed a 27 percent higher risk of coronary heart disease over a 16-year period. However, rather than being directly responsible for adverse health effects, skipping breakfast may simply be a proxy for an unhealthy lifestyle.
People who skip breakfast tend to lead more stressful lives. Participants in the Harvard study who skipped breakfast “were more likely to be smokers, to work full time, to be unmarried, to be less physically active, and to drink more alcohol,” researchers wrote in the report. In other words, the relationship between breakfast and health may not be one of cause and effect.
This is a perfect example of why a certain scientific mantra is often repeated: Correlation does not imply causation. Yet data scientists often confuse correlation and causation, succumbing to the temptation to over-interpret. And that can lead us to make some really bad decisions – which could place significant limits on the enormous value that lies in deriving predictions from data.
Predictive analytics draws on the growing availability of data to determine which factors indicate the most likely outcomes for people ranging from medical patients to criminals to employees. This practice has major implications for improving operations in healthcare, financial services, law enforcement, government, and manufacturing, among many other fields.
Yet there’s a real risk that advances in predictive analytics will be hampered by our overly interpretive minds. Stein Kretsinger, founding executive of Advertising.com, offers a classic example. As a graduate student in the early 1990s, Stein was leading a medical research meeting that was focused on assessing the factors that determine how long it takes to wean a person off a respirator. This was before the advent of PowerPoint, so Stein displayed the factors, one at a time, on overhead transparencies. The team of healthcare experts nodded their heads, offering one explanation after another for the relationships shown in the data.
But after going through several transparencies, Stein realized that he had been placing them with the wrong side up – thus displaying mirror images of his graphs that depicted the opposite of the true relationships between data points. After he flipped the transparencies to the correct side, the experts seemed just as comfortable as before, offering new explanations for what was now the exact opposite effect of each factor.
In other words, our thinking is malleable. People can readily find underlying theories to explain just about anything.
Take the incident of a published medical study that discovered that women who happened to receive hormone replacement therapy showed a lower incidence of coronary heart disease. Could it be that a new treatment for this disease had been discovered?
Later, a proper control experiment proved that this is a false conclusion. Instead, the current thinking is that the women who had access to the hormone replacement therapy were more affluent and had better health habits overall. This sort of follow-up analysis is critical so that, in this case, women are not needlessly prescribed hormone replacement therapy for a condition that it will not treat.
Businesses also can mistake correlation for causation. For example, imagine an online car dealership that discovers that website visitors who use a price calculator are more likely to end up purchasing a vehicle. This insight helps inform their predictions: It might be wise to promote use of the price calculator, or to offer a discount to customers who didn’t use the price calculator, to increase the likelihood that they’ll buy a car. But it does not necessarily explain what factors influence customers’ decisions. It may be that eager, engaged consumers are naturally more inclined to explore the website’s features in general. So working to actively promote the price calculator wouldn’t necessarily help increase sales and could be a wasted effort.
Uber offers another useful example. The company discovered that, in San Francisco, more passengers request rides from areas with higher rates of prostitution, alcohol use, theft, and burglary. However, the company knows that crime itself is not necessarily causing this higher demand, even indirectly. Rather, their original hypothesis, even before the analysis, was that “crime should be a proxy for nonresidential population.” Higher-crime areas tend to have more people who don’t live in the immediate vicinity, so these people will need rides.
Prematurely jumping to conclusions about causality is bad science that leads to misinformed decisions, and the consequences could be a lot more worrisome than an unnecessary bowl of cereal in the morning. Luckily, avoiding this mistake is simple. Companies, researchers, and governments can use predictive analytics to drive some decisions – such as flagging patients who skip breakfast – so that healthcare providers can consider additional diagnostic or preventative measures. But we must avoid giving our gut instincts too much credit and understand that our conjectures about the root cause of a predictive discovery require further analysis.
Eric Siegel, Ph.D., is the founder of the Predictive Analytics World conference series, which covers both business and government deployment, executive editor of The Predictive Analytics Times, and a former computer science professor at Columbia University.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.