Hadoop promises a low-cost way to support developers building high-performance applications to analyze large datasets. But one of the biggest complaints about Hadoop is that it processes data in batches, treating information as if it were bursts of bits and bytes rather than a steady stream in need of constant crunching. The result is a combination of processing delays and delayed access to data that can impact an organization’s ability to gain real-time insight into its business processes.
“Hadoop is essentially what you do when you sleep,” says Michael Driscoll, CEO of big data analytics company Metamarkets in San Francisco, offering a simple metaphor. “Traditionally, business intelligence has been something you do at night. And in the morning when you wake up, it tells you what happened with your business the day before. That’s batch mode and it’s really the way most business processes have worked.”
The situation only worsens when data processing is performed during a business’s off hours, thereby creating a backlog of overlooked data – and missed opportunities for powerful data analysis.
In response, a growing number of vendors are offering tools that can fill the gaps created by Hadoop’s batch-by-batch approach to data processing. Cloudera’s Impala, Storm, an open source project authored by Nathan Marz of Twitter, and Metamarkets’ Druid-based data engine are all technologies that can run on Hadoop but rely on a stream processing framework. Stream processing works by processing data as it comes in rather than logging it in batches as time passes.
Here’s another analogy: stream processing is like keeping an ongoing itemization of everything you purchased – a cup of coffee, a bagel, a cookie. If you want to know how much money you’ve spent, all you need to do is access the system to find out the total amount. Batch processing, on the other hand, requires keeping the days’ receipts so that when you want to know how much money you’ve spent, the tally is only as recent as the last time the system added everything up. For some systems, delays can be as long as half a day.
Stream processing “represents the realization among a lot of companies that it’s not enough to process their data once a day or every few hours,” says Driscoll. “They’ve got to do it as it happens. In some ways, stream processing and big data go hand in hand. If you don’t operate or compute data as it happens, you’ll never have a chance to catch up; it’s moving at such a high velocity.”
Driscoll cites fraud detection as a perfect example of stream processing at work. Imagine an American Express credit card processing a $5 parking meter fee at noon in Baltimore, a $3 coffee purchase around 1 p.m., and then a $20,000 purchase of stereo equipment an hour later in Los Angeles. Using batch processing, the fraudulent purchase wouldn’t be detected until hours later.
“With fraud detection, you want to know as soon as something happens,” says Driscoll. “That kind of credit card activity can trigger a fraud alert. The sooner you can catch that, the faster you can respond. That’s why all of the banks use stream processing for fraud detection.”
Nevertheless, stream processing does have its fair share of shortcomings. For one, Driscoll says, “Stream processing is more expensive. The infrastructure required to process data as it happens is often harder to build. It’s easier just to do things in batches.”
Another potential pitfall: stream processing can be prone to errors. For example, it’s not uncommon for stream processing to accidentally drop data packets. For this reason, some companies rely on a combination of both stream and batch-by-batch processing, says Driscoll. A commodities company, for instance, can process trades as they occur, and then recheck the data using batch processing every 48 hours. “Often they’ll find a 1 or a 1.5 percent error that was not caught with stream processing,” he says.
And sudden bursts of data can temporarily handicap even the best stream processing tools. “Imagine walking into a store and instead of buying one thing, you buy 500 things in a span of a minute,” says Driscoll. “It’s pretty difficult to add up 500 things at once in your head as it happens. Stream systems can be a little bit fragile sometimes.”
But as long as there’s a need for real-time processing of data, companies should expect to see more stream processing tools enter the Hadoop marketplace.
Cindy Waxer, a contributing editor who covers workforce analytics and other topics for Data Informed, is a Toronto-based freelance journalist and a contributor to publications including The Economist and MIT Technology Review. She can be reached at firstname.lastname@example.org or via Twitter @Cwaxer.
Home page photo of cheetah by Hein Waschefort via Wikipedia.