NEW YORK—Hadoop is a powerful technology that can store and sort unstructured data in a much more efficient way than the traditional enterprise data warehouse.
That doesn’t mean it’s the end of the data warehouse as we know it, according to David Jonker, SAP’s Director of Big Data. But it is time for a change in an enterprise’s thinking about how to blend Hadoop with other database technologies to take advantage of what each does well.
“The enterprise data warehouse isn’t dead,” Jonker said at the Strata/Hadoop World conference in New York City on Oct. 24. “But traditional data management approaches are dead.”
The new approach, Jonker said, should be a side by side framework where data scientists are able to run exploratory queries, make hypotheses and do research on the data stored in Hadoop, while the business intelligence analysts can get quick answers on everyday reporting questions using an in-memory system like SAP’s HANA.
“Today Hadoop isn’t ready for that kind of [interactive] environment,” Jonker said. “Hadoop and in-memory databases solve different but complementary problems.”
PayPal and its parent company eBay have created a blended Hadoop environment where its data is stored in the cloud, according to Nagaraju Chayapathi, a data integration architect at the company.
As raw clickstream data comes in, it’s processed in Hadoop, where it goes through a “cleaning” phase. Hadoop spits out semi-structured data that PayPal can use in a series of predetermined BI and analytics projects, and stores it in the cloud where it can be accessed by PayPal’s employees all over the world. The company collects 14 terabytes of log data every day, and uses it for things like event analytics, sentiment analysis, real-time location-based offers, customer segmentation and scoring, and a recommendation engine.
“We need that information,” Chayapathi said. “It adds to the business, and we need that information quickly so we can respond quickly.”
Chayapathi said PayPal used to just let the data go because it was too difficult to design a catch-all schema on a traditional database. Now they just run everything through HBase and Hadoop, no matter what the format.
Moises Nascimento, PayPal’s director of data architecture, said the company has a close working relationship with Informatica, which has helped them devise and execute their 48 Hadoop node architecture.
But that architecture still stands next to its Teradata relational database, because the millions of transactions done with PayPal still need to be stored and analyzed.
“We’ll always need transactions, and structured data,” Nascimento said. “The system has run well for us. It’s working.”