Statistics? Yes. Computer science? Obviously. But for undergraduates exploring data science, the most important concepts to learn are problem-solving, synthesis—correlating one dataset with another—and storytelling.
Those are the skills that will best help the next generation of data scientists take seemingly unrelated data from multiple sources and discover correlations to better understand how people, businesses and machines behave so they can solve important problems.
That was the principle behind two courses taught for the first time to undergraduates in Columbia’s “Introduction to Data Science” and San Jose State’s “Introduction to Big Data” in the fall of 2012.
At Columbia, “Introduction to Data Science” was taught by Rachel Schutt, a senior statistician at Google’s research division in New York. Schutt has a Ph.D. in statistics from Columbia, and first proposed the course as a seminar, featuring several guest speakers talking about their jobs in data science.
That idea grew into a course in the statistics department that Schutt would teach and left her groping for textbooks and curriculum guidance. After buying 10 to 15 books on varied topics like machine learning and experimental design, she couldn’t find one that contained all the topics crucial for a class on data science.
“It took me a while to realize that the fact that there wasn’t a textbook meant that this was an innovative, new thing,” Schutt said. “It actually caused me a great deal of confusion and frustration before the class started.”
She ultimately scrapped the single textbook idea, instead relying on her own experience and conversations with colleagues at Google, professors at Columbia and the New York City data science community to guide the curriculum.
The topics she settled on, which she described as “threads to run through the entire course,” were: machine learning algorithms and modeling, data visualizations, computer coding and ethics, and data science “habits of mind.”
Teaching the Data Scientist Mindset at Columbia
Schutt said the habits of mind section was hard to define and difficult to teach; she viewed them as not only tips and tricks from professionals but also promoting a higher level of understanding of what the mindset required for a successful data scientist.
“These are things like creativity, knowing what to do when you don’t know what to do, and how to ask good questions,” Schutt said. “It’s not really a skill, and it’s hard to teach, but that’s actually something I found most interesting about trying to create a class like this, which is how do you teach those things that people say you can only learn on the job. It’s really about being a thinker, and being a curious person.”
These topics were reinforced in the coursework and by several guest speakers including from Google and other technology companies as well as other faculty members. As part of the course Schutt wrote a blog, both as a resource for students but also to document and reflect on the course as she was teaching. Like the class’s guest speakers, the blog has guest writers, including other data science professionals, professors and students from the class.
The class involved several technical and practical topics. Schutt said she taught students how to run statistical algorithms on massive datasets across multiple machines and the problems that arise in that process, which are still being figured out in the market. She covered how to create compelling data visualizations, something she struggles with herself and said there aren’t nearly enough classes on.
She spent a lot of time on ethics, discussing the ethical pitfalls of building consumer-facing products where decisions are made by machines using metrics that might not understand the whole picture. Products in health care, mortgage banking or credit scores can affect a person’s whole life, so a lot of thought needs to enter into automated processes.
Many of the topics covered in the course aren’t new, Schutt said, pointing out that statisticians rightly get uptight by the notion that using historical data to better understand business or human behavior is passed off as something entirely new. It’s not—it’s been happening for a long time. But what is new, Schutt said, is the convergence of several new technologies that create massive, unstructured datasets.
What data scientists have to learn are the skills needed to analyze those unstructured datasets.
“The type of data we have now, because of the Internet, it is different than it was even 10 years ago,” Schutt said. “It’s location-based data and time-stamped data, and all the data that humans leave behind as traces of themselves on the Internet and the Web.
“We’ve let technology into our lives a lot more over the last 10 or 15 years. That means that data disseminated from that technology is a bigger part of our lives and that means we’re more accepting of it. People always want to learn about people, and so now there is this relationship between machines and people that is more pronounced than it used to be.”
The Quest for Correlations in Unstructured Data at San Jose State
The San Jose State course, “Introduction to Big Data,” was taught in that university’s computer science department by Professor Peter Zadronzy, who is also a performance consultant for Splunk.
Zadronzy consulted with Rob Reed, Splunk’s worldwide education evangelist, as well as several other Splunk employees to discuss what big data skills were overlooked in the university’s computer science courses.
The conversations were “wonderful and sometimes exhausting,” Reed said, but the final result was distilled down to this: “How do we handle huge volumes of information whose structure, size, velocity, [and other characteristics] we do not know beforehand. We know there is value in it, but how do we extract business value out of it? That was at the heart of the San Jose State course.”
Zadronzy had technology partnerships with Splunk, Cloudera and GoGrid to illustrate the techniques for teasing out insights from large datasets that seem to be completely unrelated. Students used Splunk in teams as part of a final project, and presented the results to the company’s engineers and executives at Splunk’s San Francisco offices.
Reed said the crucial concept in today’s big data education is the key value pair, where a single key identifier can track a person’s behavior, whether through clickstream data on the web or geolocation data via a mobile device.
But according to Reed, the differentiator right now in business is not the technological skills to store key values. It’s teaching students how to think about different and interesting relationships in the data that can be identified through pairing key values from different data sets.
“Let’s turn out students who are not bound by [a single] way of thinking, and who … will think of correlations across datasets that nobody has thought of before,” Reed said. “When students get that ability to correlate, it’s amazing to see the light bulb go on.”
To Reed, the ability to correlate and synthesize is the difference between someone who works with data and a data scientist.
“If you can’t synthesize and tell a good story, you’re not going to a good data scientist,” Reed said. “You might be a good data functionary, or a data analyst, but nobody is going to look to you to solve a problem.”