To Scott Nicholson, the field of data science is like a playground with a purpose: to use data to understand how people make decisions, and then use that insight to help them make better data-driven decisions. The equipment is all there: the datasets, the technology tools, the statistical algorithms, and an unyielding sense of curiosity. When you add the drive to achieve results—and the ability to influence the means to reaching those results, in an iterative fashion—you have the whole package.
Nicholson, the chief data scientist at Accretive Health, a company that works with hospitals on payment systems and health care quality, set forth his vision for the data scientist role at the recent Predictive Analytics World conference in Boston. He also explained what makes a successful data science team member.
The term data scientist is likely to be refined over time, Nicholson said, noting the history of “software engineer” has gone from a general job category to more specialized functions as the software industry matured.
“Data science is a mixture of engineering, math, computer science,” said Nicholson, who has a Ph.D. in behavioral economics from Stanford and worked as a data scientist at LinkedIn before joining Accretive. “The idea is that being a data scientist, what you are doing is that you are brining all these different skills together, and the sum of those things is greater than the parts themselves.”
Nicholson emphasized that a data scientist gets involved in every aspect of the process, from collecting data to implementing an analytical model.
“You want to start with asking the right questions [and work] all the way down to making those insights actionable,” he said. “That is the entire world of trying to have an impact with insights that you are culling from data.”
He said the process involved five elements:
1. Ask the right questions.
A data scientist approaches a given field recognizing the possibilities in a particular set of circumstances—an industry’s challenges, for example—or a particular dataset, Nicholson said. Data scientists are eager to answer questions and address problems given situations and datasets related to them.
Take LinkedIn, for example, and its network of 175 million professionals. “LinkedIn, from my perspective, is one of the richest datasets on the planet. So if you are excited about data, you should be able to go look at that dataset and think of different things to do” to develop new insights, he said.
The health care industry opens another frontier. “Think about that industry, there are so many problems you can answer with that data. So you should have a long list of things that you want to tackle,” he added.
From these possibilities, the questions flow and the chase for answers is on. A data scientist can talk about the issues to explore, the datasets required, the sources for that data, the analytical models to be applied to the datasets. This quality “is the most important thing in a good data scientist,” he said. It is also a difficult quality to find, he added.
2. Don’t rely on the latest statistical model—use simpler ideas, too.
The quest for answers should not lead to a sprint to the state-of-the-art predictive model—a common mistake, Nicholson said. Not every problem is a nail waiting for a sledgehammer. That is not the fastest way to results. Many smart people have solved many problems in the past, and it’s good to look to common and useful approaches, and straightforward ideas to seek fast results.
“The goal is to get to value,” he said. “Good data scientists are always thinking about getting to value. You’re not thinking about techniques. You’re not thinking about infrastructure. Any time someone on my team says ‘I just read a paper,’ I immediately know that OK, we probably need to refocus our energy here.”
Nicholson the photos now prevalent on LinkedIn’s recommendation engine came about after Sam Shah, a software engineer at the company, proposed a test of the idea. Show users a photo of a person they might join on the network; would it entice more engagement?
The idea worked. “Sure, you need algorithms along with that, but if your goal is to get value quickly and you have this idea, ‘Wow we can make this better if you put a photo on it,’ good data scientists think like that,” Nicholson said.
3. Extract and clean your data.
Nicholson estimated that data science teams spend 80 percent of their time cleaning data. It’s not glamorous—he likened it to running a race through a thick sloppy mud pit, calling the process “dirty, disgusting work.”
The reason data scientists need to be involved in this process is to keep the project on track, so that everyone on the team knows which data to include in the work. Otherwise, he said, engineers on a big data project “are dumping whatever they want into a Hadoop cluster and downstream as a data scientist you are trying to consume this stuff, and you have no idea what it means. And you realize you have to go find the engineer and clean it up. It’s just a mess. So the lesson is to have your data scientist sit next to the people who are logging the data.
In health care, he said, “most of the work and a lot of the barriers to creating value” involve extracting data from electronic health records. The records “are great in a lot of ways, because now rather than doctors having those shelves and shelves and shelves with paper patient medical records, now it’s all in an electronic form which on one hand is amazing and that’s powering a lot of growth in health care analytics which I’m really excited about. At the same time it’s hard to get the data out of the electronic health records and it’s hard to get the data into the electronic health records.”
“So from my perspective a lot of the work that I’m going to have to do is really in getting around those areas. But it just fits into the broader story. It’s not like you can just show up and go like I’m a brilliant data scientist and you give me data and I’ll put in a model and we’ll go and get some beers, this is dirty work,” he said. “And [data scientists] have to be part of all of it to make sure that you’re getting what we need to actually put into the model and put into the visualization.”
4. Build a model—but don’t reinvent the wheel.
Building an analytical or predictive model does not mean building a model that’s new and unique, Nicholson said.
“In my experience, the problem you are solving, it’s already been solved. Someone has already done this,” he said. “There is no need to go out and invent the wheel. You just need to go out and find the wheel. And apply that information. Apply that technique. “
Nicholson cited Yahoo and its data science team following this in the past when the company deployed a model using logistic regression to target online advertising. He said he expected the company had added refinements to its model over time, but he hailed the approach of using what he called a “workhorse model.”
“For the goal of getting to value quickly, for learning about your data and getting any feedback, then you want to make sure you move to value as fast as you possibly can,” he said. The approach of testing out established models saves time and reduces the chance a team will go down a wrong path.
“It kind of allows you to avoid these bogeys where you realize that my data, it wasn’t really what I thought it was. Or this variable which I thought was crucial to my model is just not important at all. I can learn those types of things very, very early on in a spectrum of approaches, one being not so technical and one being super technical. Make sure you’re constantly iterating with workhorse basic standard approaches before you get into the really difficult stuff.”
5. Deploy the model—and iterate.
It is essential for data scientists to be involved in every aspect of a project, Nicholson said, from start to finish including deployment.
While this end-to-end engagement is a straightforward concept, that doesn’t make it easy to do. Nicholson said that staying with a project through deployment is another challenge for many would-be data scientists—just like asking the right questions. For some, the tendency is to split up the work too much.
“A lot of times I have seen teams where a question comes down to them and they send off a data request to another team that extracts the data, they get the data, they build a model and they ship off the model to someone else. That is not data science. That is doing statistics or predictive modeling,” he said. “It’s harder to be good at that, when you can’t own the end-to-end. You need to be able to iterate.”
This iterating on a project is critical to achieving results, to getting to value, Nicholson said. It means working with other stakeholders to understand important issues, key datasets—and being able to defend the theories and thinking behind analytical models.
In his work in health care, Nicholson said he sees the potential for data science to make a big impact.
“Population health management is essentially about finding a population of patients and being proactive about their care. Our health care system is purely reactive, and so if we can find the one person who is most likely to have an unplanned episode of care and we can reach out to them, and touch them, and say, ‘Hey, you forgot your meds,’ or ‘Make sure you do this,’ it changes what goes on every day. Whatever it is, we can then dramatically increase the quality of life for patients,” he said.
“I can build a model and I can prioritize patients [for recommended actions], but if the physicians that I’m working with, if they don’t understand what this model means, or how you interpret it, or why the person’s first, you have no impact. You need to be part of that end-to-end process. So that you’re sitting there with the physicians and saying, ‘Hey, I did all this great work and this is why this person is No. 1 on the list to call. And here is what the model is telling us. This is what the [data] means.’ You need to be part of that deployment,” Nicholson said.
The problem with leaving before a model gets deployed is that subject matter experts may leave the data-driven model and rely on past practice rather than what the data suggests is a beneficial idea to try, he said. Staying involved in the deployment also enables the data scientist to check the results of an analytical model, and then work with business partners to improve it.
“For every person for every observation in my dataset, and every prediction I am making, I should be able to say, this is my prediction and this is why, and this is the data that went into the model,” Nicholson said. “Because that way, just as a sanity check for me, before a model gets shipped out, I can see exactly if things make sense or not. So if I’m saying this person is a high priority [for action] and it turns out that for some reason this person has never been to a hospital at all and I have no data on them, that [result] is wrong. Something is wrong with my model. And I need to iterate on this.”