SAN FRANCISCO – Well-known stats geek Nate Silver surely surprised some in the audience at the Rich Data Summit when he reeled off a litany of big data misconceptions and missteps during his keynote last week. Silver himself is something of a poster child for big data insights after having correctly called the outcomes of 49 of 50 states in the 2008 Presidential election and the following year was named by Time magazine as one of The World’s 100 Most Influential People.
Silver, who runs the polling aggregation site FiveThirtyEight, says big data can be enormously useful, but it’s often misunderstood.
His first of several examples was a 2008 article in Wired that described big data as a computational problem. “The article gives the impression that if you have millions, or billions or trillions of observations and powerful enough computers, then eventually you can make correlations and discover truth by brute force alone,” said Silver.
The article and other reporting he’s seen convey the idea that big data is magic. “You get your data, you press a button and all of a sudden you have extremely valuable output. This idea is very wrong and dangerous.”
He drew some laughs with the observation that if big data insights were in fact so easy to attain, the data scientists at the conference would be unemployed.
Silver noted that, like earlier technology, interest in big data is starting to peak, with more references lately to “data science” than to “big data.”
“It’s to the point that big data is being dismissed as over-hyped, and that’s a good thing,” he said.
Silver said that this is because technologies we take for granted as valuable now often were ridiculed on their way to legitimacy. He referred to a 1979 article in the New York Times that said a computer in the home isn’t that useful, and that the idea that it is useful is just a bunch of hype. In the mid-1980s, articles and reports spoke of a “productivity paradox” in which some researches asserted they could find no evidence that information technology helped worker productivity or contributed to economic growth.
“Sometimes it takes 10 to 15 years (for new technology) to really pay dividends,” said Silver.
While he’s a believer in big data, Silver said it needs to be leveraged in the right way. “Just collecting more data can get you more ways to fool yourself. Data scientists aren’t interested in data for data’s sake. We are interested in relationships,” he said.
There’s also a human element. Politicians, for example, often “lead with their gut” when championing an issue and then “dress it up with data based on the assumption they started with,” said Silver.
“I think the reverse approach is what’s needed. You should be 80 percent as analytical as possible at the start and then work out the rest.”
Data is Messy and Noisy
And then there’s the issue of when to report data at all. Silver said it’s not uncommon for data analysis to be in conflict, and he feels sometimes you are better off not reporting anything at all pending the development of clearer results. For example, he said one set of data showed Uber was doing well at attracting African American passengers in New York.
“But if you did the regression analysis another way, Uber was doing terrible at attracting those passengers. When you see something like that, you probably shouldn’t publish anything,” he said. “Data is a lot messier and noisier than people want to acknowledge.”
Also, Silver said, as we analyze and interpret data, it’s important to recognize that we all have biases. He cited an example from Sheryl Sandberg’s book Lean In, which showed that, in industries like tech, male job candidates usually were picked over female candidates, even though they had identical resumes.
“When you ask people, the ones who say they are gender blind tend to be the ones who have more bias than others,” said Silver. “Be aware that having diversity of thought in your organization is going to be really helpful. Be willing to change your mind when you are wrong. If you don’t have an organization that rewards honest feedback and discussion, then none of the technology you have is going to matter.”
Veteran technology reporter David Needle is based in Silicon Valley, where he covers mobile, enterprise, and consumer topics.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.