We have watched Hadoop grow through two significant phases of maturity, defined by how companies first used it and what tools emerged to create a vibrant ecosystem in the process. Recently, we’ve started seeing another set of changes that clearly indicate that we are entering a third phase of Hadoop maturity – one that is more robust and characterized by new kinds of functionality and accessibility.
In the early days of Hadoop, it was a new tool that a few scattered groups explored for research projects. Users could run MapReduce and HBase, and early tools like Pig and Hive made Hadoop easier to use. During this initial phase, people still thought in terms of “writing jobs” and “will this job complete at all?” rather than in terms of applications, workflows, predictable run times, and operability.
When it became clear that Hadoop has the ability to provide real business value, departments started building workloads for business intelligence and reporting to extract meaningful insights. Suddenly, IT needed to care about things like predictable run times, running different kinds of workloads across a shared infrastructure (for example, running MapReduce and HBase together), efficiency and ROI, disaster recovery, and similar concerns that are typical of “real” IT projects.
Similar to the previous phase, this second phase of Hadoop maturity was mirrored by increasing development of the Hadoop ecosystem as a whole. This is where innovations like YARN, Spark (for lightning-quick streaming and analytics), and Kudu entered the scene.
Enterprise Requirements for Hadoop
Today, it’s clear that we are entering a third phase of Hadoop within enterprise environments. In this new phase, Hadoop is accessible to all business units, and we begin to see multi-departmental uses. IT organizations now must serve all of these business units, a solution that many call “Hadoop as a Service” or “Big Data as a Service.”
With this third phase comes a whole new set of requirements for Hadoop that are important to consider. For example, when multiple departments are using shared infrastructure, they demand SLAs – it simply doesn’t work if one group’s use of Hadoop slows down other projects beyond an acceptable limit. As Hadoop demands an increasing amount of a company’s IT capacity, it’s more important than ever that it be efficient. Couple that with the many departments and hundreds – or thousands – of users running jobs on the same cluster, and the operations group needs far more visibility into performance. It also becomes critical for IT to be able to allocate usage back to each department accurately.
In addition, there are, of course, a myriad of other requirements that enterprises face once business units start sharing data and compute: granular access control, business continuity, regulatory compliance, and so on.
Although Hadoop has come a long way since it was introduced, it has not yet reached full maturity, and it won’t until it becomes fully enterprise IT ready. This will require a number of additional elements. For example, we are seeing a blossoming of vendors that are trying to address different parts of Hadoop data security with tools similar to those that currently exist for databases and data warehouses. In addition to improving security, there are a number of things we still need to do to make Hadoop fully enterprise grade. But, even more importantly, the potential for writing and running distributed applications on Hadoop has just begun to be realized.
Applications that don’t yet exist will one day run on Hadoop and become more consumable by the average business user, not just someone who knows SQL or wants to write code. Similar to what has happened with the emergence of tools that interface with databases (such as Salesforce, which runs on custom deployment software), non-technical line-of-business users will run applications on Hadoop without realizing it. In this way, it doesn’t matter how many phases Hadoop goes through or what new tools are built – it’s still going to be Hadoop at its core in the end, and the possibilities are endless.
Chad Carson is the cofounder of Pepperdata. At Microsoft, Yahoo, and Inktomi, Chad led teams using huge amounts of data building web-scale products, including social search at Bing and sponsored search ranking and optimization at Yahoo. Before getting into web search, Chad worked on computer vision and image retrieval, earning a Ph.D. in EECS from UC Berkeley. Chad also holds Bachelor’s degrees in History and Electrical Engineering from Rice.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.