Hadoop is a major advance in storing and crunching large unstructured datasets, but its batch processing approach often isn’t right for projects that involve a quick explorative process or advanced algorithms.
That weakness—the inability to reuse intermediate results across several simultaneous computations—is what researchers at the University of California Berkeley’s AMP lab is trying to overcome by the open source project Spark. Started in 2010, the Spark development team released version 0.6.0 Oct. 15.
Spark pulls data from the Hadoop Distributed File System (HDFS) into an in-memory computing environment and allows quicker query times, more interactive data mining and iterative analytics that can lead to faster insight.
Matei Zaharia, one of the creators of Spark, is a Ph.D. student at Berkeley studying on a Google fellowship. He said the project began when some co-creators were trying to use Hadoop for machine learning algorithms and found the system far too slow.
Hadoop can be too slow for some algorithms when you’re looking for answers now, he said; Spark can make jobs that take hours in Hadoop run in several seconds.
“There is actually a lot you can do in the computing engine to improve performance for these iterative applications, an algorithm that makes multiple passes over the same data,” Zaharia said. “Most people are using [Spark] to run analytic reports faster, especially to do interactive applications. An analyst or an end user can launch new queries on the data and get answers very quickly.”
Predictive Modeling Use Case
That was the case for Erich Nachbar, the CTO of analytics startup QuantiFind, who found that Spark reduced the predictive modeling jobs his company does from about an hour in Hadoop to tens of seconds.
QuantiFind helps marketing officers for large entertainment companies, like movie and video game studios, predict how successful their releases will be for certain demographics. Nachbar said the company uses machine learning algorithms and text mining and provides predictions based on several factors like number of screens, marketing budget and what people are saying on blogs, Facebook and Twitter.
Nachbar said he’s been using Hadoop since 2007—he organized the first user meetups in the Bay Area—but his original prototypes in Spark were such a success that QuantiFind was able to rebuild its system based on Spark’s in-memory data crunching.
“We found out that the jobs that run in RAM are so quick that we could actually drive our UI through it. Spark doesn’t have the Hadoop latency where a job starts in 20 or 30 seconds, it starts instantly,” he said.
Nachbar said Spark is quick enough to apply brute force to batch processes of its data, running predictive modeling iterations about 100 times in seconds. It works well enough that Quantifind’s custom user interface allows users to compare several marketing options simultaneously.
Ion Stoica is a professor at Berkeley and is an original contributor to Spark. He is also the CTO of a Conviva, a company that optimizes streaming video for major online content providers like ESPN and HBO.
Stoica uses Spark at Conviva to optimize video streaming for users. Running machine learning algorithms, Spark can quickly assess and decide how to offer the best streaming video experience for individual users, tweaking the video quality, transmission location, and other factors depending on the instantaneous analysis of the connection.
“The goal of implementing that loop is to improve the quality of video streaming as fast as possible,” Stoica said. “At the end of the day what you want to do is to optimize the engagement. If you have lower quality [videos], you are going to watch less and you’re not going to return the site.”
An Avenue to Stream Processing
Nachbar and Stoica said they’re excited for Spark’s next big step: stream processing.
Zaharia said by the end of the year the development team will add support for stream processing, where the system will include data in its calculations as they come in in real time. That will allow users to write one piece of code and have it run in Spark either in streaming or in batch manner.
“We want to have a platform that can do both low latency streaming and low latency interactive or iterative algorithms,” Zaharia said. “The idea is you would get an interface to write a streaming query, so it would look a lot like a MapReduce or a Spark query. You use the same high-level operators, but as new data comes in the result gets updated over time.”
Some of the engine improvements in the 0.6.0 release already make use of code for streaming, doubling performance for some jobs, according to the project’s website.
The project team is also working on a version of Hive, Hadoop’s data warehouse infrastructure, for Spark called Shark, providing an SQL like language for database queries.
“For every person who writes MapReduce jobs in Java, there are five or 10 people who don’t know how to do that but do know how to write SQL queries,” Zaharia said. “Having those run with the same kind of speed would be pretty cool.”
Shark will be available in the next couple of weeks, Stoica said: he’s replaced his Hive system at Conviva with Spark and reduced the time for creating high-level interactive performance reports from 12 hours to half an hour.
Zaharia said a key design choice for building Spark was using Scala, which he described as a “high level language that allows concise, functional programming.” The project took off when Zaharia and his collaborators found that Scala’s interactive shell could be modified to run iterative data mining algorithms.
“Once we got that to work it was immediately clear that we could build a lot more than machine learning algorithms on top of Spark,” he said.