In a recent meeting with roughly 100 CIOs and other IT executives of Global 2000 companies, the topic of data scientists came up. Everyone seemed to be looking for data scientists, and everyone agreed that finding talent was tough, a conclusion supported by recent NewVantage Partners survey data. Intriguingly, most also felt that a newly hired data scientist would lack the business context for asking the right questions of enterprise data.
In other words, enterprises may be looking in the wrong place for data science talent, setting up their Big Data projects to fail.
Part of the problem lies in the very name “Big Data.” Enterprises become so intent on the sheer volume of data being collected that they lose sight of the much more essential act of intelligently querying the data for insights. In other words, the goal of the data scientist isn’t to ask bigger questions, but rather to ask better questions.
To do this, context is key.
Gartner analyst Svetlana Sicular highlighted this in her analysis of rising demand for data scientists, arguing that “Organizations already have people who know their own data better than mystical data scientists.” As such, enterprises should look within for expertise because “Learning Hadoop is easier than learning the company’s business.”
And yet so many don’t, despite the crushing need for data talent. According to a 2011 McKinsey Global Institute report, the United States must boost its data-savvy graduates by 60 percent, given that roughly 500,000 data science jobs await, leaving the U.S. 190,000 qualified data scientists short by 2018.
Many new data professionals are expected to come from graduating students, as EMC found in its survey, depicted in the pie chart above. While this is a good long-term source for talent, the better source today is behind the firewall of one’s enterprise.
This is the best way to ensure an enterprise’s data science team is deeply integrated into the business, rather than a foreign body that “gets data but not our business.” As renowned statistician Nate Silver argues in his book The Signal and the Noise,
[N]umbers have no way of speaking for themselves. We speak for them….If the quantity of information is increasing by 2.5 quintillion bytes per day, the amount of useful information almost certainly isn’t. Most of it is just noise, and the noise is increasing faster than the signal. There are so many hypotheses to test, so many data sets to mine–but a relatively constant amount of objective truth.
In other words, more data equals more noise, not necessarily more signal. To get at the “truth” in our enterprise data, we need to be equipped to ask the right sort of questions, which generally means having familiarity with the business itself, and not merely abstract data.
The best data scientists, then, and the best data science teams, will be those that best function as a welcome extension to existing teams, rather than an outside body holding court on enterprise data.
Is there a data scientist shortage? Perhaps. But that may be because we keep looking in the wrong place: outside our own organizations.
Matt Asay is vice president of corporate strategy at 10gen, the company behind MongoDB NoSQL database. With more than a decade spent in open source, Matt is a recognized open source advocate and board member emeritus of the Open Source Initiative.