Dan Dunn does not call himself a data scientist. Dunn, a product and operations manager at HubSpot, a maker of Internet marketing software based in Cambridge, Mass., refers to himself as someone who “pushes data out to as many people as possible” in his company of IT experts, business intelligence analysts and business leaders, to run experiments, refine hypotheses and—after much iteration and analysis—discover actionable business insights.
Discussion about the nature of the data scientist’s role—whether they should be trained as scientists, be domain experts in their organization’s work or machine learning programmers—is likely to continue for some time as the demand rises for people who can analyze unstructured datasets that grow in size and complexity.
But for the four participants at a recent panel discussion on “A Day in the Life of a Data Scientist,” there was consensus on the nature of their work: bring strategic leaders closer to the insights they seek—especially if they have not expressed their goals in terms that make sense from a data systems perspective. Collaborate with domain experts to assess their work and derive meaningful intelligence from the data. Educate them about how data analysis works, and how it can help them, when their theories and requests are not statistically or mathematically meaningful.
At the April 10 event organized by the Massachusetts Technology Leadership Council at the IBM Innovation Center in Waltham, Mass., Dunn and three other experts—Michael Kane, an associate research scientist at the Yale Center for Analytical Sciences, Mike Keohane, vice president of software engineering at OwnerIQ, and Ian Stokes-Rees, a research engineer at Harvard University—discussed how they navigate the workload that has systems maintenance and programming performance on one side and pent-up demand (and weighty expectations) from colleagues caught eager to see their data crunched.
In the Center Ring
Those colleagues know where to find their favorite data source, the panelists said. “A typical day for me is very interruption-driven, because as the person who knows where all the data is, you get a lot of questions about, like, ‘Dan, what is the relationship between this and that?’ and ‘Dan, do you think this is going to be true?’ and ‘Dan, do you think that’s not going to be true?’” Dunn said.
The key is to get to where data insight meets business need. “For our customers, it’s a question of whether we can help them convert them from a [website] visitor to a lead, and from a lead to an opportunity, and from an opportunity to a customer. And so, you’ve got this pile of data, and you say, ok, how can I go fishing through this data?”
At Hubspot, Dunn said, he will write some Hadoop jobs to get a good dataset, and then analyze it using tools such as Tableau’s visualization engine. “A lot of times I look at that and say, ‘Wow, there is absolutely nothing here.’ And I throw it away. And other times I look at it and I say, ‘Wow this is close,’ but I look at it but the actual piece of data that I need is still back in those back end files. And then I am running another Hadoop job to generate another dataset that is very similar to the last one but isn’t quite what I was looking for.”
At the end of the process, Dunn said he shares reports iteratively with “the business owners, the people who can steer me in the direction to know what the value to the business is. I also know it myself, but you always need to talk to those people. And in the end, I’ve got a refined report that I can either produce repeatedly, or one off, or hand to some engineers to be truly automated. That is my general work flow over a couple of months to generate a set of reports on a specific set of data.”
Keohane, whose Boston-based company makes targeting software for online advertisers, said he has two main jobs: one is to make sure “the big data machine,” including its business intelligence platform and data warehouse team, is running and producing jobs.” The other is to connect those jobs to strategic goals. And that is a two-way street. “It’s really [about] understanding what the business stakeholder is looking to achieve,” Keohane said. “Not so much can we produce the dataset that is based on the questions they are asking but are they asking the right questions?”
The data scientist in business needs to cover both functions. “It’s a constant vigilance to get out of your seat, sit with the stakeholder, and understand the actionable next steps that they are taking as a result of consuming the data, so I think there is no typical day, but it’s being very concerned with mechanics but also the bigger picture,” he said.
The User Whisperer and the Tools and Systems Builder
Kane, the Yale research scientist, and Stokes-Rees, the Harvard research engineer, had similar experiences to share but with a different emphasis. Kane discussed how essential it was to collaborate with his domain experts, biologists and other clinical researchers who are eager to publish new data-rich studies. He said he works to engage the researchers in discussions about their work, about their insight into the spread of H1N1 avian flu in Egypt, for example, “to understand their narrative, and then think about that narrative as I am doing data exploration.”
“Their ideas, they are usually not math savvy but they have tremendous intuition,” Kane said. “My job is usually to understand to try to get an idea of where their intuition comes from, and why they have this intuition about these biological processes, and then think about, how can I encode this as an analysis to further this research?”
Stokes-Rees describes his role at Harvard’s engineering department as an engineer who is a service provider. “I am trying to design useful systems, to take techniques that I am developing as a computational scientist and I am trying to make those techniques generalizable and accessible to other people,” he said.
His goal extends beyond academic papers, he said, “to produce some systems and tools and web interfaces, web-based environments, or software packages that may be used by people in our community to be able to work with terabytes of data with people who are around the world. And users can do in a user-driven way” and does not take his staff resources, he added.
Reese, who recently joined Harvard’s engineering department after working as a data scientist at Harvard Medical School, said domain experts such as post-doctoral medical researchers are hungry for data-driven insights—so hungry that his introductory programming courses would see 75 registrants 12 hours after the it was offered. “It’s not an illusion that there’s an increased demand. People are desperate to figure out how to do good stuff with all the digital information that is available to them,” he said. “For all of these PhDs with lab-based skills, data is everywhere in their research. And the tools that are available are hard tools to use.”
The panelists each referred to their ongoing efforts to educate business stakeholders about the nature of data analysis. It is common for these internal customers to misunderstand what the analysis can do, and it can be necessary to correct them.
“If a stakeholder is asking the wrong question [a data scientist’s job is about] educating them on why you think it’s the wrong question and what is the right question to ask,” Keohane said during a discussion of this point.
“When I’m with a business user who for example might be interested in understanding click-through rates on a segment of our population, measured in absolute terms, I would say, ‘You don’t really want that because that’s not going to tell you much. How about we index that against the universe of populations and you will get an index value instead of a click-through value.’ There’s some education in bringing them along on that mindset. I think as long as you do so they start to think differently about the next question they ask.”