Edwin Chen is a data scientist on the revenue side at Twitter, exploring how advertising runs on the service’s live stream and analyzing how different kinds of ads impact different populations.
He previously worked for Clarion Capital as a quantitative trader and studied linguistics and theoretical computer science at MIT. His data visualizations have been featured on FlowingData.com, and Chen has no problem penning a several thousand word treatise on any topic that interests him on his blog, such as the dos and don’ts of using Amazon’s crowdsourcing service Mechanical Turk.
Chen said Twitter offers a unique opportunity for companies to learn how their marketing is reaching people, how their products are viewed and to listen in on conversations about different aspects of the business.
He’s only been a Twitter for about a year, but as the company allows its data scientists to focus on areas that interest them, he’s concentrated on things like visualizations, machine learning, crowdsourcing and says he hopes to implement Spark, an open source cluster computing system for data analytics in the near future for interactive analysis.
Chen spoke to Data Informed about his work at Twitter, how he hopes Spark can speed up his computing jobs, and his interest in data visualizations. Below is an edited version of the conversation.
Data Informed: What does being a data scientist at Twitter entail?
Edwin Chen: We have a bunch of data scientists at Twitter, and we all do different things depending on our interests. In general, it means that we do a little more quantitative work than we would normally, and how much more quantitative work depends on our interests. Some of us basically just do analysis all the time, some of us work on more machine learning problems, some of us focus more on visualizations.
What is your role? What are the things you’re interested in?
Chen: I work on the revenue team at Twitter, so what I do is a lot of analysis on how our ads are performing across different markets. Things like what kind of ads are performing better in the U.S. compared to other countries, and what types of ads are performing better. It’s interesting just to see how performance does differ across countries.
I also work on our machine learning models, so things like how we can better improve search targeting, and how we can better show more search ads. And I do spot data visualizations.
How has your role evolved, and do you see it changing more?
Chen: When I first started at Twitter, we didn’t have a lot of visualizations, so I started by doing a lot of visualizing because that was something that I was really interested in. We have a pretty healthy visualization system now, and that really helps us better understand all our ads.
Another thing that I started was using crowdsourcing to improve our ads. A lot of our metrics are based on click metrics, so whenever we launch a new experiment, we want to make sure it’s increasing CTR (click through rate), for example. Another thing that CTR doesn’t measure, but we want to make sure we are measuring, is relevance.
One thing I’ve been doing is introducing the use of human computation so when we launch a new experiment, we’ll show judges around the world a bunch of ads and ask them if they’re still relevant to the users.
If you were running a business, how would you use Twitter’s data?
Chen: Assuming that my business was large enough that people were talking about it online, then I’d mine it to see what particular parts of my business they are talking about. I’d look to see what certain features of my business are being talked about most. Say if someone posts a link about my business, seeing which ones get retweeted more often, so seeing what people are generally more interested in. It’s kind of a way of doing user research through Twitter.
If you are a smaller business, is there a similar opportunity?
Chen: I would say that all depends on whether you can get enough data. As long as you can, you can understand how users interact with it better.
What new opportunities or technologies are there at Twitter?
Chen: One thing that I’ve been interested in starting to use at Twitter—there are a couple of people interested in this and trying to get it started —is there is a new big data type system coming out of University of California Berkeley called Spark.
Now whenever you run a MapReduce or Hadoop job, it’s kind of a batch process, so if I launch a Hadoop job I have to wait kind of a long time for the job to get started, and for it to get back to me because this is the way that Hadoop is built. It’s built as a batch system, you can’t really do things like interactive analysis on it. It’s still valuable to evaluate massive amounts of data.
But with Spark, the machines the job is running on, instead of those machines having to reload it all the time, it just stores the data in RAM, so your jobs finish much faster, and fast enough that you can very easily do interactive analysis.
I can write a query and it will process terabytes of data and return to me in a couple of seconds. This is going to be very useful in doing more interactive analysis, if I want to quickly figure out what’s going on, or if sometime in the future if you were able to hook it up to an active visualization system and then you could visualize and manipulate massive amounts of data. That’s going to be really cool.