Software developed in the open source community earns its way into enterprises when it becomes easier to use, more reliable, and a better fit with incumbent technologies. Hortonworks, supplier of an Apache Hadoop distribution, has another item for the list — remaining thoroughly open source.
The Yahoo spinoff recently released its Hadoop package, Hortonworks Data Platform (HDP) 1.0. Hadoop is open source software for batch processing huge amounts of data — typically unstructured or multi-structured — on clusters of commodity hardware. Beyond the core Hadoop stack, HDP 1.0 includes open source software for managing Hadoop (Apache Ambari), pulling data into Hadoop (Talend Open Studio) and accessing data within Hadoop (Apache HCatalog).
As big data fever sweeps the corporate world, enterprise IT shops are looking to Hadoop. Several companies are aiming to replicate the Linux model and make a business of offering enterprise class Hadoop. Notable among these are Cloudera, which boasts Hadoop inventor Doug Cutting as its chief architect, and EMC’s Greenplum, which has the marketing and R&D power of the world’s top storage vendor behind it.
Hortonworks employs many former Yahoo staffers who worked on the Hadoop project with Cutting. The company is making the most of its open source bona fides. “Everything in our product is open source,” said Hortonworks CTO Eric Baldeschweiler. “There are no holdbacks or proprietary components.”
This approach has helped Hortonworks sign up some heavyweight partners: Microsoft and Teradata. Microsoft tapped Hortonworks to bring Hadoop to Windows Azure and Windows Server. Azure is Microsoft’s cloud offering.
Teradata has teamed up with Hortonworks to integrate Hadoop with the company’s Teradata Aster analytics suite. Teradata acquired Aster Data last year.
An important percentage of the world’s data will end up being stored, preprocessed and transformed in Hadoop, said Tasso Argyros, vice president of marketing and product management for Teradata Aster. “Our goal is to become a significant part of the Hadoop ecosystem,” he said. Teradata Aster aims to provide business analysts with innovative ways to analyze data stored in Hadoop Distributed File System (HDFS), he said. “Hortonworks’ contributions to Hadoop are extremely valuable and in sync with our view of the Hadoop ecosystem.”
Key for Teradata was Hortonworks’ inclusion of HCatalog, a new open source metadata and storage management service, said Argyros. HCatalog makes it easier for Teradata Aster’s software to access data in Hadoop and interoperate with other projects within the Hadoop ecosystem, he said. This will help Teradata work with customers like eBay to develop new algorithms for mining data stored within Hadoop, he said.
Teradata also works with Cloudera. Cloudera links Hadoop to Teradata’s data warehouse systems so that data can be transferred between structured and unstructured data stores.
Reliability is a critical issue for enterprises, and Hadoop is not yet high-availability software. Version 2.0 of Hadoop addresses the vulnerability of a core component, NameNode, which maintains a directory of all files stored in Hadoop. Hortonworks has not deemed Hadoop 2.0 ready for prime time. Instead, the company uses an interim solution for the NameNode vulnerability. Hortonworks uses VMware’s vSphere virtualization software to eliminate NameNode as a single point of failure. vSphere allows NameNode to failover to a standby copy.
The vSphere integration is a workaround, said Tony Baer, a Principal Analyst at market research firm Ovum. “The technology for making Hadoop virtualization-friendly has yet to emerge, which means that, for now, developers will have to integrate Hadoop to vSphere manually,” he said. “High-availability will be an ongoing saga for Hadoop.”
Even with Hadoop 2.0, client components could see interruptions of a minute or more, said Baldeschweiler. “All of the client software above HDFS needs to be made resilient to those kinds of interruptions.”
High availability promises to become even more important as Hadoop evolves to meet the needs of corporate users. According to a survey that Capgemini and The Economist released in June, 84 percent of North American executives surveyed said that the biggest issue is being able to analyze and act on data in real time.
Many enterprises that use Hadoop today are simply offloading work that they could do in traditional data systems except for the issue of scale, said Baldeschweiler. “They understand what they want to do; it’s fairly mechanical. It’s just that their system is breaking or their pocketbook is breaking,” he said.
Hadoop has typically been used to crunch tremendous amounts of data about customers and their activities, said Baldeschweiler. “You classify a set of user interests, and then you hand off that problem to a different system — maybe a database — to handle the real-time interaction.” That’s changing, he said. “Now what we’re seeing is more and more of that [real-time] serving workload is working its way into Hadoop.”
Beyond high availability, Hadoop in general is a work in progress.
In a recent blog post, Ovum’s Baer said that Hadoop has to conform to the enterprise to gain acceptance. “Hadoop won’t cross over to the enterprise if it has to be treated as some special island,” he wrote. In order to make it in the enterprise, Hadoop must “mesh with the practices and technology approaches that enterprises are using to manage their data centers or cloud deployments,” Baer wrote.
“Hadoop is young,” said Hortonworks’ Baldeschweiler. “It’s got plenty of rough edges, which is, of course, why there’s plenty of room for distribution companies like Hortonworks to get in there and help,” he said.
Eric Smalley is a freelance writer in Boston. He is a regular contributor to Wired.com. Follow him on Twitter at @ericsmalley.