Major data warehouse vendors have begun to market and ship Hadoop appliances, opening a bridgehead for the open source project and other new technologies to be a part of the enterprise data architecture of the future.
In the process, companies like IBM, Oracle and Teradata have had to move past their previous, very successful model of consolidated data storage. They’ve been pushed by customers who need the flexibility to deal with unstructured data and a fast iterative process for quicker results.
By embracing a new, more flexible and more modular architecture for data warehousing that includes Hadoop, the major vendors are starting to move into a market that until now has been populated by startups, niche players and open source initiatives. They are somewhat behind the curve—anyone interested in building their own Hadoop cluster has been able to for years—but the vendors bring credibility to Hadoop by trading on their trusted names in enterprise business settings.
Carl Olofson, an information management and data integration research analyst at IDC, said by going the appliance route, companies like IBM, Oracle and Teradata assume the risk inherent in trying new technologies.
“They aren’t too late, because in reality business managers are conservative when it comes to these things. They don’t just jump on to these things because they’re new and cool, they actually have to have everything worked out and they have to have confidence in the vendors,” Olofson said. “Business people are looking for guidance what they can use Hadoop for, and how they can use it as risk-free as possible.”
In Oct. 2012, Teradata announced its Aster Big Data Analytics Appliance featuring Hortonworks’ Hadoop distribution. At the Hadoop Summit in San Francisco in June, Teradata announced three more Hadoop offerings, ranging from a more straightforward Hadoop appliance to downloadable database software backed by Teradata and Hortonworks service and training.
Oracle and IBM also have a similar market approach; they both have downloadable software to connect their other database and analytics products to data stored in Hadoop. Oracle and IBM both offer customers the Cloudera distribution, though IBM has its own flavor of Hadoop. Both IBM and Oracle have turnkey Hadoop appliances. Oracle started shipping its Big Data Appliance for general availability in Oct. 2012, and IBM is set to release its PureData for Hadoop appliance this year.
A Shift in Strategy
There are other companies with Hadoop appliances: EMC has the Greenplum HD Data Computing Appliance, and now its own Hadoop distribution called Pivotal HD. Direct Data Networks has a storage-focused appliance.
But for these traditional giants of the data warehouse, the break from the all-in-one database approach to a modular model is a shift in strategy. Just five years ago, according to Nancy Kopp, IBM’s director of the PureData appliances, the top database vendors were preaching a doctrine of consolidation, instructing customers to store every drip and drab of data possible in a massive enterprise data warehouse.
“We had a vision where we could consolidate everything into one massive database where we could govern it and manage it,” Kopp said. “The challenge was that by doing so, we made it extremely complicated to manage that environment.”
Users all needed the same, single technology to do several different things, often all at the same time. Kopp, whose background is in business intelligence, said she realized that all her presentations to customers were more about workload management instead of getting value from their data. “We moved so far away from agile, it wasn’t even funny,” she said.
|There are now several Hadoop appliances on the market.|
|Data Direct Networks||hScaler Appliance||Apache Hadoop|
|EMC/Greenplum||Greenplum Data Computing Appliance||Pivotal HD (EMC)|
|IBM||PureData for Hadoop||IBM|
|Oracle||Big Data Appliance||Cloudera|
|Teradata||Appliance for Hadoop||Hortonworks|
|VMware||Project Serengeti||Works with several versions|
|Sources: Vendors’ information.|
Kopp and her colleagues at IBM realized that the consolidated enterprise data warehouse just wasn’t working. Its core purpose, recording and analyzing an enterprise’s historical transaction data, will likely always be necessary. But with the big data explosion—that is, lots of different kinds of data coming from lots of different sources—it became apparent that the EDW couldn’t handle an increasing amount of unstructured and semi-structured data.
That same realization was happening at Oracle and Teradata. The massive relational databases were encountering workloads that weren’t well suited for the consolidated environment.
Olofson said customers were demanding technology that add flexibility to their warehouse products and serve as an ingestion point for the unstructured data that confounded a rigid schema system.
To take advantage of some of the exciting new data sources, like streaming data from sensors, social media or Web use logs, or to make use of older sources better, customers were taking data out of the data warehouse, and that made the EDW vendors take notice.
“There have always been times when people have used relational databases for things they’re not good at doing, and now they can use other technologies instead,” Olofson said. “The old relational database set-up where anytime you want to make a change to a table it’s a two-week project … that had to change.”
An Opening for Hadoop
The poster child for this big data phenomenon has been Hadoop, a horizontally scalable storage and compute cluster system that operates without a schema. The system stores and recalls data differently from traditional relational databases, so data can be added to the system without major changes and recalled quickly for analysis.
Each of the new Hadoop appliances has several similar features, and is meant to operate alongside the data warehouse. Each offers some type of SQL query language on top of the Hadoop distributed file system to make the data stored in the Hadoop cluster available to a broader set of business analysts. Teradata’s is called SQL-H, Oracle’s is connected by Oracle SQL, and IBM by BigSQL.
George Lumpkin, Oracle’s vice president of product management for big data and data warehousing, said Oracle is on its second version of SQL connectors for its Oracle Big Data Appliance. Oracle was first to the market; shipping the appliance at the beginning of 2012, he said.
Those SQL connectors also connect the Hadoop appliance to other products, like business intelligence tools, event processing tools or Oracle’s NoSQL database. Lumpkin said Oracle is still in the early phases of adopting their warehouse environments to big data. Hadoop is a piece of that environment, but just that.
“A big data solution is not a single product; big data is not Hadoop,” Lumpkin said. “For the types of problems that businesses want to solve, it’s going to be an architecture. That architecture will extend from the current data warehouse, add new technology, and then add new data sources.”
Right now, the Oracle architecture still moves a lot of data from system to system to be processed in specific environments. It works, Lumpkin said, but Oracle is trying to make it work better.
“Moving data is bad, and we do have environments where we do perhaps need to move data more than we want to,” Lumpkin said. “In the coming years, that will change and get much better, but today we focus on optimizing the data movement, and then reducing the data movement when necessary.”
IBM is operating on a similar premise, moving from the consolidated architecture to what it’s calling a “zone architecture,” Kopp said. The architecture is much more modular, meaning instead of one large data repository, data can be stored and analyzed in smaller, more specialized systems that are built for specific functions. IBM’s PureData system for Hadoop is scheduled to ship for general availability in the second half of 2013.
The goal, she said, is to put “the right workload in the right place for the right cost.”
More Targeted Uses for the Data Warehouse
Both Lumpkin and Kopp said a common use case for Hadoop is as a landing zone for unstructured data, which can be processed and then shipped to the data warehouse for analysis. Some analysis can be done in the landing zone, or “data factory” as Lumpkin calls it, if the process is suitable for the batch environment of Hadoop’s MapReduce function.
Kopp said another good use case for the appliance is “behind the warehouse” as an archival system, taking advantage of the relatively low cost of a Hadoop-based system and making “cold storage” still accessible to analysts if needed. By placing the data in a low-cost Hadoop cluster, data can still be recalled and isn’t shelved on physical media that isn’t readily accessible.
“We’re starting to see a lot more of this online or active archive capabilities that people are using to offload the warehouse of all the cold data, which makes a lot of sense,” Kopp said.
So instead of loading everything into the warehouse and keeping it there, businesses can pre-screen data in Hadoop to reduce what goes in, and then offload seldom used data but still keep it accessible. This frees up some computing power in the warehouse, Kopp said.
“The warehouse is becoming a lot more targeted and nimble in what it’s doing,” she said.
Hadoop is also capable of data analysis, although because it requires Java programming skills it can be a more complicated proposition, even with SQL connectors. With the famed skills shortage for big data analysis, technology companies of every shape and size are trying to build tools to make querying and building business applications easier on Hadoop. IBM, Teradata and Oracle are tackling this in their appliances by building in analytical function accelerators as aids..
“Everybody talks about these analytics functions; it’s kind of like the great new analytics arms race. How many analytics functions do you have in your box?” Kopp said, adding: “What’s challenging is how you actually pull them together and make them useful. So what the accelerators do is they group these functions to provide a platform to accelerate specific kinds of capabilities.”
Varied Paths to Hadoop Integration
The big three data warehouse vendors take different approaches to integrating their appliances with Hadoop. Teradata touts an engineering partnership with Hortonworks. Oracle uses a stock Cloudera Hadoop distribution, and IBM offers its own distribution.
Teradata’s Aster Big Data Analytics Appliance features more than 70 analytics features, its version of SQL for Hadoop called SQL-H, connectors to Teradata’s other systems and software bundles from partners like Informatica, Revelytics, and Protegrity for data integration, metadata management and security. Teradata also is offering an Appliance for Hadoop, which doesn’t include all the analytics functions.
Teradata and Hortonworks have also worked together with Dell to offer a preconfigured commodity hardware stack for Hadoop without many of the added features in the two appliances, and Teradata will offer its Hortonworks-engineered software for download, for companies who want to buy their own hardware.
Each of the four offerings includes SQL-H, and perhaps more importantly, full service and support from Teradata and Hortonworks to help companies work their way through the difficulties of standing up their own system.
“We want to make sure we can provide the technology in the right form or fashion for them,” said Steve Wooledge, Teradata’s vice president of marketing for unified data architecture.
Wooledge said Teradata’s strategy of offering more than just turnkey appliances is trying to reach businesses that haven’t traditionally bought their products.
“This is about growth, this is about incorporating Hadoop as part of our enterprise approach to data management and analytics,” Wooledge said. “There are Teradata customers that are doing commodity clusters, but by and large, they are primarily looking at the appliance offering. This is a new segment for us in a lot of ways.”
John Kreisa, Hortonworks’ vice president of strategic marketing, said the many flavors of Teradata’s Hadoop offerings helps his company continue its mission of promoting enterprise Hadoop adoption. “This massively expands the ability for Hadoop to get out there into the market, and make this technology successful and help it reach its full potential,” Kriesa said.
Traditional Relational Databases in Need of TLC
Olofson said where the traditional enterprise data warehouse vendors have to get to work is rebuilding the relational databases that are still crucial to business operations. Olofson said there are a new class of database companies that are using in-memory technology and other approaches to make the relational database more elastic and flexible and therefore more suited to big data analytics.
Amazon is trying to disrupt the market with its cloud-based data warehouse offering Redshift, which is priced to undercut traditional data warehousing. VMware also offers a virtual Hadoop environment called Serengeti, a private cloud-based analytics platform.
Just like with Hadoop, enterprises looking at new database technologies are going to go with companies they trust, Olfoson said. But if a NewSQL company can prove its staying power, that could shake up the enterprise data warehouse market.
“These vendors are so new that most established enterprises are not going to place a big bet on them because they don’t know if they’re going to be around in a few years,” Olofson said. “That gives people like Teradata some time. At the end of the day, what business users want is a safe, well managed environment in which they can solve business problems.”