What began as the data management brainchild and initial project from the likes of Google and Yahoo has turned into the data management catalyst for the entire big data era—of course we are talking about the Apache Hadoop project. One can’t think of Hadoop as one open source project, however, for there are several related Apache projects, and Hadoop sub-projects, that in a sense make up open source totality of Hadoop. The numbers and scope of those projects and the reach of open source Hadoop seem to continually expand.
The commercial ecosystem that has grown up around Hadoop, however, has made even more of an astounding expansion than the open source project that started it. The number of vendors that now offer some kind of Hadoop connector, BI and analytics software platform based on Hadoop, analytic apps leveraging Hadoop, specialty analytics plug-ins for Hadoop, hardware and networking appliances and products designed for Hadoop, professional services for Hadoop, and Hadoop cloud services runs well into the hundreds, if not the thousands. The implications from such a movement for IT and line-of-business personnel are considerable, and it’s important therefore to understanding where Hadoop is going.
Just as the supply side has bet heavily on Hadoop, customers have picked up the Hadoop baton to a large degree. There are all kinds of studies, surveys, and estimates out there, but I would offer a summary guesstimate that in the Global 2000 about 25 percent of organizations are doing something with Hadoop that is production-oriented, another 25 percent are playing with it, and about half really have not delved into it much yet, if at all. For such a wide-impacting set of open source-based technologies and services, that progress with customers ranks as very impressive uptake. But what is next?
In order to try to determine Hadoop’s future, it is important to look through the open source lens, but also the commercial lens. In Hadoop’s early days, the open source community pushed the commercial side. Now, the commercial side of Hadoop has gained so much momentum that it now pulls the open source community.
I spoke with the CEO of MapR, John Schroeder, whose company offered one of the first commercial Hadoop distributions, so has roots very close to the open source project. At the same time, however, MapR has arguably run the hardest and fastest to make Hadoop plus its own technologies palatable for the most demanding of commercial enterprises, such as banks. Let consider Hadoop’s future, including Schroeder’s point of view, mixing the open source and commercial perspectives.
Next for Hadoop: Enterprise-Class and the Data Lake
Based on the primary focal points coming out of the Hadoop Summit in San Jose in late June, it appears there are several areas of Hadoop open source that have to do with reflecting Hadoop’s enterprise-class maturation and momentum, such as:
• Development: Yarn is the nickname for the new version of Hadoop cluster management capabilities which have experienced “a complete overhaul.” Yarn primarily shifts Hadoop from a singular batch to an abstracted plug-in design to handle many more workloads including batch, interactive, search, online, streaming. In a sense it opens up the Pandora’s box of use cases to MapReduce and beyond. Schroeder opined that, “Yarn is a significant enhancement and we will integrate Yarn into all the major distributions, but ultimately customers want an even more general purpose abstraction layer to support SQL access, various databases such as Riak and Cassandra, and document models.” When asked for more specifics, Schroeder expanded that, “We should be supporting self-describing objects like JSON, with APIs for SQL, MongoDB, other file systems, plus HBase and MapReduce.”
• System management, security and governance: Apache Ambari brings Web UI front-ends for systems management, through the use of RESTful APIs, for the first time to Hadoop. (RESTful refers to representational state transfer, a programming architecture for Web services.) The idea here is to expand the management software aspects of the commercial ecosystem to make it easier to provision, track, monitor and manage Hadoop. Other related Apache projects mean to make Hadoop more palatable to CISOs and auditors.
When I asked Schroeder about issues like security and infrastructure he was firm: “We recently spent time with several Fortune 100 and Web 2.0 companies, and what we was heard loud and clear: They want to deploy on an internal cloud, so they need multi-tenancy.”
He continued: “The idea is they want to span the organization, from enterprise to department to role or domain specific use cases and apps. They need to tie it into service levels. So in order for the data lake to reach fruition, it will need to deliver enterprise-class SLAs and security. That is one of the reasons our most recent release included a big security upgrade. We also need to do a better job on Hadoop with virtualization. It is tricky because the ‘data plus compute’ design of Hadoop is fundamental to Hadoop’s value proposition, but it is difficult to virtualize. Perhaps we could add some data locality intelligence to Hadoop; we are already doing something like that with VMware.”
• Data lakes and the future of the data warehouse: One idea that has caught fire is a Hadoop-based “data Lake.” The idea is to, using Hadoop, create a next generation data warehouse. The data lake would include large volumes of semi-structured data and structured data, and would include the more modern globally distributed infrastructure with which older generation warehouse struggle.
Schroeder linked all of these latest Hadoop concepts together into even a bigger vision for Hadoop: “On a grand scale, the data lake is where we are going, or call it a data platform. The vision for MapR is not really about the next gen data warehouse platform but about the next data platform. The scope of the data lake is far larger than analytics using Hadoop. For example, if you were to compare the original Teradata data warehouse to the Oracle database, the Oracle database is more general purpose. The data lake concept may be used for BI and analytics, but also for operational purposes, for table stores, for Blobstores and it will even support transactional semantics.” (Blobstores refer to an API that allows applications to serve large data objects.)
• The need for more sophisticated metadata management: Moving from the strategic data lake to the more tactical here-and-now of Hadoop in a big data analytics context, Schroeder said he feels that Hadoop could do a better job with metadata.
“When I was at BRIO in 1995 we were building OLAP cubes that understood metadata,” he said. “If you look at some of the similar tools available running on top of Hadoop, they aren’t that sophisticated, and that is because Hadoop is not offering full metadata management. This would open up Hadoop to more SQL-oriented developers. We realize there is a large population of SQL programmers and familiar SQL products that need access to Hadoop. The SQL layer is only one component of the solution and the metadata is equally important. MapR supports HCatalog as a Hadoop data dictionary, but we also see great value in document model databases and self describing objects such as JSON. Marrying those data sources with columnar functions and search for unstructured is the solution for the next decade.”
So what is next for Hadoop? Clearly the idea is to continue to make Hadoop more comfortable to use by enterprises, in terms of security, management and particularly new application use cases. The concept of moving Hadoop from pure a BI and analytics data platform to the more general use case of the “data lake” is arguably the most important takeaway for IT and line of business personnel to consider. But Hadoop-based “data lakes” will never come into being if the fundamentals of performance, management, security and developer reach are not met. As Schroeder summarized, “Supporting things is great, but having them work is even better.”
Evan Quinn is founder and principal analyst at Quinnsight Research, covering big data, business intelligence, analytics databases, integration and data-as-a-service. Reach him at firstname.lastname@example.org. Follow him on Twitter: @evanquinn.