NEW YORK—Still a young technology, Hadoop has a lot of innovation going on around it. And this week at the Strata/Hadoop World conference, Cloudera, one of the veterans in the field, introduced an open-source, real-time query engine for the Hadoop Distributed File System called Impala that caught the attention of most attendees.
“What everyone seems to be talking about is how to build a real-time query language over HDFS,” said Chris Tynan, an analyst at Spotify, the digital music service. “Impala is an attempt at that.”
Impala, now in beta, features a real-time query engine for data stored in the Hadoop distributed file system (HDFS) and on the tabular database HBase. Charles Zedlewski, the vice president of product management for Cloudera, said the product borrows the SQL language and other elements from Hive, a SQL-like query and data warehouse engine. The process circumvents the MapReduce distributed computing process, skipping the batch job and querying the data stored in Hadoop directly.
Zedlewski said Impala treats HDFS and HBase, a distributed columnar NoSQL database, like a storage manager. “Impala makes better use of memory than MapReduce does; you can pass data from step to step without having to write to disk. Things like exploratory analytics or iterative queries, that’s where Impala is going to shine.”
At Spotify, Tynan said a technology like Impala could be valuable. The streaming music company is trying to make more data driven decisions to better understand its users and how they interact with certain artists or changes to the product. The faster he can arrive at those insights about customer behavior the better, he said.
“All our data comes in via Hadoop,” Tynan said. “We get it at the other end and try to break it down quickly.”
Tim White, an associate at the technology and strategy consultancy Booz Allen Hamilton, said products like Impala show that Hadoop distributors are starting to plug holes in what’s still a developing technology.
“It’s maturing,” White said. “People have discovered the holes in a lot of the initial use cases, and now they’re trying to plug those holes to get to the next step. The notion of doing real-time big data analytics with something like Impala, that’s in its infancy.”
MapR Boosts HBase Performance
MapR, another enterprise Hadoop distribution, had its own announcement: MapR’s version M7, which fully integrates an enterprise grade version of HBase into the technology stack.
Jack Norris, MapR’s vice president of marketing, said about 45 percent of Hadoop users also store data in HBase. MapR’s M7 distribution greatly improves HBase’s performance, provides a backup process and point in time recovery, and now allows for unlimited tables.
“We simplified the HBase architecture significantly,” Norris said. “It’s part of the cluster, it’s integrated into common data architecture and common data management.”
Another company making an impression at the conference was Platfora, a business intelligence and analytics software built specifically for Hadoop, officially launched its product at the conference.
The Platfora software is built on top of Hadoop, takes advantage of in-memory processing and provides a Web-based user interface that’s intended to allow business users to explore data quickly and visualize the results.
Users can create what Platfora creator Ben Werther calls a “lens,” which is a special subset of data stored in Hadoop that can be pulled into an in-memory layer and crunched quickly over several iterations.
Werther says users can create one of these lenses in about an hour, and perform interactive querying in seconds. He holds up his product against the traditional BI tools which, he said, can take months build the processes that uncover interesting data and deliver them to dashboards and visualizations.
“If you take the traditional view of the world, you have to have this structured data warehouse and you have do all this work just to get going, and that’s painful. Hadoop turns things around completely,” he said.