Hadoop System Developers Carry on Quest for Real-Time Queries

by   |   February 28, 2013 11:14 pm   |   0 Comments

SANTA CLARA, Calif.—The whole idea behind implementing an analytical system in business is to deliver the right information to a decision-maker at exactly the right time. That information, based on data that’s been stored, analyzed and likely visualized, helps the decision maker guide his or her business in a profitable direction.

So what if the exact right time is the moment when the data is collected, known as real time? As it stands now, a system built just on Hadoop isn’t equipped to produce analyzed data in real time. The technology at the heart of Hadoop, a distributed file system that’s scanned in the batch computing process MapReduce, can’t deliver insights on data while it’s collected.

Related Stories

Developers target Hadoop performance lags in drive for real-time analytics.

Read more»

Innovative relational databases create new analytics opportunities for SQL programmers.

Read more»

Cloudera’s Impala offers first step in real-time analytics for Hadoop.

Read more»

At the Strata/Hadoop World conference in Santa Clara, it’s clear there is plenty of brain power pointed towards the problem of creating a real-time analytics and reporting system that can handle massive data sets.

“The world is not static,” said Nagui Halim, an IBM Research Fellow presenting his company’s take on real-time analytics. “What you’re looking at and what you’re analyzing could be changing almost instantaneously.”

That effort to build real-time capabilities has taken researchers down several different avenues. One is adding interactive SQL support to data stored in Hadoop, like Cloudera’s Impala, Hadapt, or the Apache Drill project, both loosely-based on Google’s ad-hoc query system called Dremel. These projects aim to query data stored in Hadoop quickly, allowing for iterative questions to pull insights from the data.

Justin Erickson, a senior product manager at Cloudera, said Impala can take advantage of Hadoop’s flexibility, scalability and efficiency, which Cloudera believes is “an order of magnitude better” than any other data store. Hadoop scales, runs on commodity hardware and is fault tolerant.

Researchers also have to connect the millions of skilled SQL users to Hadoop. Unlike MapReduce programming, they understand how to develop SQL queries. This opens the massive datasets stored in Hadoop to business intelligence veterans.

By providing a SQL connection with Impala, “I can take the hundreds of SQL users in my enterprise and point them at Hadoop with an interface that they’re used to and a response time they’re used to,” Erickson said.

Impala’s query response time is seconds, instead of minutes with MapReduce. This won’t return sub-second responses that some systems are striving for, but it will be able to connect to the dozens of business intelligence tools in the market, like MicroStrategy or Pentaho, and get results in less than a minute. Impala is still in private beta.

Teaching Older Databases Some Big Data Tricks
Another way to get at information as quickly as possible is to look for ways to update older database technologies to handle the new big data workloads that Hadoop specializes in, like updated relational databases Drawn-to-Scale, NuoDB, or TransLattice.

Tim O’Brien, an application developer and blogger, suggested that in the rush to adopt Hadoop or NoSQL alternatives, many people moved too quickly past relational databases without giving them a chance to evolve.

“Relational databases are seen as a design failure up front by certain kinds of developers at certain kinds of companies,” O’Brien said. “We’ve moved away from structure, but structure can be good for developers.”  O’Brien said that Google’s announcement of F1, a new relational database that’s fault tolerant and currently in development, shows he’s not alone in his belief.

The Open Source Route
Still others involve cobbling together several open source technologies to work around Hadoop. Justin Langseth of Zoomdata and Byron Ellis of LivePerson showed off two systems based on open source ingredients including the database Kafka, the stream processing system called Storm, and the visualization library d3.js.

For Ellis and LivePerson, real time is the whole game. LivePerson serves as a customer service and engagement platform for thousands of companies, and according to Ellis: “Everybody wants to know about everything in real time. This is especially true about workforce management; this is your constrained resource in customer service.”

Zoomdata launched its new visualization and business intelligence product on the Apple App Store on Feb. 27. It visualizes any mix of real-time data streams on an iPad or on the Web. Langseth, Zoomdata’s CEO, said the ability to add any visualization, any data stream or any algorithm for analytics is what will set his company’s product apart. He said Zoomdata is aiming to be as simple and usable as Google Earth.

“We want people to be able to zoom deeper and deeper into data and allow them to explore the data in real time,” Langseth said.

Tim Palko and C. Aaron Cois of Carnegie Mellon University showed of an entirely open source system for monitoring and analyzing sensor data: open source analytical database Redis, Python programming language and PostgreSQL.

The sensors themselves are open an open source project from their university called Sensor Andrew. Palko and Cois explained their entire engineering process: There was no Hadoop involved, but both suggested Hadoop would be needed down the road to store the terabyte of data a large sensor network would create each week.

At Etsy, Engineering Challenge Versus Practicality
The rush towards real time comes with caveats. Representatives from Etsy, the online handmade goods retailer, gave a presentation about funnels it built on its website to capture clickstream data. The online retailer captures data about how shoppers find what they want from its website to gain insights on how users could interact with Etsy’s systems better.

The team’s original system was built on Hadoop, but they tried to improve the system by developing a real-time clickstream analysis through PHP and HTML. The benefit of the system was it was easy to use for people who didn’t understand Hadoop; but it had several drawbacks.

The real-time data stream was difficult to reconcile with data that was stored elsewhere, leading to data inconsistencies. The system also couldn’t look back in time past a few hours, only monitor what was happening.

The system turned out to be more trouble than it was worth, and revealed a fundamental truth about Etsy’s workflow, according to Wil Stuckey, an engineer there.

“At Etsy, we don’t make [website efficiency] decisions in real time,” he said. “We’re willing to wait for more data so we can be confident in more decisions.”

Stuckey and his fellow engineers built out a newer system in Hadoop, providing more access to non-engineers, but still relying on a daily batch process instead of real time.

“It was a hard lesson learned,” Stuckey said. “We spent a lot of time and effort building something we didn’t end up needing.”

Stuckey said the question of how to create a real-time analytics system is a fascinating engineering problem, but real-time analytics aren’t necessary in a lot of situations.

“Oftentimes people let interesting engineering problems get in the way of practicality,” he said.

Email Staff Writer Ian B. Murphy at ian.murphy@wispubs.com. Follow him on Twitter . 

Home page photo of Cloudera booth at the Strata conference by Ian B. Murphy.

Tags: , , , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>