In an age of ever-increasing data volumes and data sets, analysts and business people alike must become much more careful how they approach questions and use statistics.
That was the message in Kaiser Fung’s riveting keynote address at Predictive Analytics World in Chicago. The VP of Business Intelligence at Internet video-sharing company Vimeo, Fung is also the author of “Numbers Rule Your World: The Hidden Influence of Probabilities and Statistics on Everything You Do.” In addition, he teaches business analytics and data visualization at New York University.
Quoting from his book, Fung said, “When more people are performing more analyses more quickly, there are more theories, more points of view, more complexity, more conflicts, and more confusion.”
In the coming five to 10 years, there will be “less clarity, less consensus, and less confidence,” he predicted.
For Fung, the solution to this worrisome situation is the development of what he calls “numbersense,” a problem-solving instinct that, he contends, is necessary before analysis begins.
Analysts with numbersense tend to have a plan, avoid distractions, and “recognize wrong turns early,” Fung said, and they are able to adapt their analytical strategy when new information arrives.
Fung said numbersense can help avoid common analytical problems. For instance, many such problems involve observational data, such as a Web server’s logs, that were collected for other purposes and are only used incidentally and after the fact by others, such as marketers, to try to uncover correlations and trends. But these observations are generally taken at face value and aren’t subjected to rigorous experimental design.
“When you are running an experiment, the type of questions you can actually address are, ‘What if we color the buttons red or green, or send the email an hour later, or insert a video?’ ” Fung said. “Instead, what generally happens is someone asks the data analyst to account for some unexpected observation, such as a decrease in traffic to the Web site.
“You know what happened, and now need to figure out why, which is the exact opposite of an experiment,” he added.
A second issue is the lack of legitimately random control groups, which makes attribution (“Our consumers bought this product because they saw the ESPN ad”) suspect. Marketers, Fung said, tend to attribute things like conversions to the “channels they can actually observe” but they don’t consider the influence of unobserved channels.
In a world of observational data sets without control groups, “it is really important to be thinking about what would be the right control,” he said.
Likewise, problem solvers need to consider what data may be missing. Fung said analytical errors in big data are caused by what he calls OCCAM. This refers to information that is:
- Controls (lack of)
- Complete (seemingly)
- Adapted (data collected for one purpose and repurposed for another)
- Merged (joined datasets that make attribution even more difficult and introduce anomalies).
“Most of the big challenges today are not really about the amount of data,” he said. “The amount of data can always be solved by more storage and faster processors, and so on. But a lot of these problems I am pointing out are endemic across all big data analysis.”
College and university programs that are now pumping out data professionals in record numbers may not be helping either, Fung said, because these programs focus on teaching statistical techniques, such as computing standard deviations, rather than training students how to spot and address the right problems.
The right problems, Fung said, often don’t have a single correct answer. They also involve lots of uncertainty, require assumptions (or figuring out what the assumptions were), lack complete information, or have problems with the data itself, he said.
For instance, Fung said some of his own New York University students give accurate answers but miss seeing fundamental problems with the dataset itself. Other students notice the problem but don’t feel “uneasy” about it. A third group spots the abnormality and tries to explain it. A fourth group, he said, notices the problem, feels uneasy about it, and tries to explain it. But, crucially, that group goes on to develop approaches to the data that can be used to solve it.
This last group, he said, has numbersense.
“Numbersense differentiates a good data analyst from a bad one,” Fung said. “Additionally, in the age of big data, this is also going to be an important skill for any citizen to develop.”