The first Hadoop cluster went into production in January 2006 and spent the next decade ushering in the era of big data analytics, fundamentally changing every industry and organization that values knowledge and insight. It quickly attracted an active and devoted community of open-source contributors, and even large enterprises turned down their noses and took notice.
In a conversation with Data Informed, Hadoop co-creator Doug Cutting, now Chief Architect at Cloudera, recalled the origins and pondered the future of the transformational data processing platform named for an elephant that lives in his sock drawer.
Data Informed: Could you have imagined a decade ago that Hadoop would become as widely adopted and important to industry as it is?
Doug Cutting: No, it wasn’t on my mind at all a decade ago. What was on my mind then was trying to build an open-source project that would survive. My goal was to work on software that would get used and keep being used, ideally. I learned through [the Apache] Lucene [project] that open-source was a great way to do this. It almost gave you an unfair advantage toward adoption. People would adopt it very readily and use the heck out of it because they didn’t have to pay anything, and they could even help fix it when they had problems.
The software we had in [the Apache] Nutch [project] wasn’t to the point that anyone could pick it up, easily use it, and see the value. It was pretty raw stuff and it needed help. And that was why I joined Yahoo 10 years ago, renamed those core components that needed this work Hadoop, and tried to build a community around that and build the robustness of the software so that it would attract a community. It’s a bit of a chicken-and-egg problem there, that Yahoo helped us solve in 2006, 2007, where really made it robust enough that other folks could get involved
By 2008, 2009, we clearly had something that was going to succeed, was helping people, and was going to be a project that had a life of its own. And that’s as far as I had imagined 10 years ago. So at that point, we succeeded.
What are your thoughts about what this platform has become? What surprises you most about it?
Cutting: I think the surprising part is the cultural part. There was a culture of enterprise software, and people would only trust things that came from very establishment companies, the IBM database, the Oracle database, the Microsoft database. And I had always worked on this flaky, fringy software that didn’t have anything to do with that tradition. For the things I worked on, we didn’t use relational database software, nor did we expect people in the enterprise ever to use the software we worked on. To me, the biggest surprise is that, to a large degree, those two communities have merged. Big banks, insurance companies, railways, and retailers now accept that open source is a valuable source for technology and they are willing to bring it in-house. And the open-source community is now respecting big enterprises as a valuable destination and collaborating with these folks and helping deliver products that meet their needs, taking security and reliability much more seriously. That change I didn’t see coming, that these two communities of software development have come to accept one another and are now, to some degree, merged.
Cutting: It’s hard to declare that definitively. My standard answer is, “It’s an adolescent.” But, more realistically, as long as software is evolving, it’s alive. If software stops evolving, it’s dead. It’s now – what’s the polite word we use for old, dead software? Legacy. And I think it’s a long way from being legacy, although there are components of Hadoop that are already becoming legacy. MapReduce is on its way to being legacy. We’ve got Spark, and new things that are replacing it. And I think we are going to continue to see that for a long time. This is the new world of enterprise software: a loose confederation of open-source projects, with some standards – either APIs or file formats or so on – to integrate them so that you can replace and add new ones to the stack and the stack itself can evolve, without a real body that can apply brakes and stop this evolution. In the past, you had these big companies that control these core platforms and sold them, and they had very little incentive to try to develop a very different platform, to evolve that platform fundamentally. Now the platform really is controlled by the community of users. They are innovators who develop new products that are essentially conjectures that the market might adopt them. Then we see what gets adopted and then that becomes the new standard part of the platform. Any particular component may not last forever, some might last a long time, some might not, but this new style of creating software for enterprises and for all sorts of industries I think is here to stay.
Do you think that Hadoop is being utilized to its full potential? Are there uses you think Hadoop is well suited for to which it isn’t being widely applied?
Cutting: We are in the early days. Most institutions are just getting their feet wet. It takes them a long time to get up to speed. Hadoop has been slow to add the security features that people need in various industries, the ease-of-use features, the integration with their existing tool set so they can import data. All those things are prerequisites for a lot of the latent applications of the stack that are out there. There’s a lot of data generated out there and the quantities are growing tremendously. The raw material and the value of that material to organizations to improve their business and better compete is incredible. Put that together, there’s a lot of potential for growth. We are seeing pretty regularly around the world now certain applications start to take off. Just about every bank needs to do better risk-detection management, and other technologies really are unsuitable and so they are in the process of deploying a Hadoop-based solution to manage their risk. Telecoms face churn challenges, customers are switching. Other industries have churn issues as well and this platform is incredibly good at addressing that problem. And so they are rolling that out there. There’s other problems in these industries and other industries that aren’t really being addressed. Manufacturers do a lot of real-time monitoring of their production but also monitoring of their products in the field. You got your heavy-equipment manufacturers like Caterpillar who are in real time sending data from all their equipment in the field, working in a mine in Australia, streaming back to Peoria, IL, real-time trying to predict when that truck is going to fail and get that maintenance scheduled proactively. That’s a big advantage for them, and I think we are going to see that sort of thing spread across those industries. And there’s more of these [applications] that come along. People are imaginative, and given the tools and the data and the know-how, they are going to come up with more appliactions like this.
Which projects do you see as being the next big thing in the Hadoop ecosystem?
Cutting: If I knew with any certainty, I’d love to tell you. It’s hard to know what’s going to strike the right nerve, what people are going to find useful. I think Kudu is very exciting; a new storage engine that offers a lot of low-latency, random-access capabilities that HDFS doesn’t while still permitting the fast analytics that you can do on the flat files in HDFS. I think that’s going to be incredibly popular. It’s being embraced by a lot of different open-source projects already.
Continuing unification of security is really required. You want to be able to set capabilities and visibility of data in one system and have that seen as the data travels across to other systems. We launched a new thing called Record Service, which addresses this. It’s giving you this consistent view of data, managing its access with a single set of ACLs to manage across systems. I think that sort of stuff is something we are going to focus more and more on, the sort-of management and glue layers, storage, formats like Avro and Parquet, aren’t necessarily the exciting, sexy stuff that people want to hear about, but they are the important things to make the ecosystem work. You need to be able to have things move between components to extract the value. If you wanted a single, standalone system, you wouldn’t use this ecosystem. You’d probably use some past-generation technology. The real promise here is to be able to bring all data together into one system and explore it with a variety of tools that are going to give you a lot more power.
What do you think lies ahead for Hadoop over the next several years? What does the next decade hold for Hadoop?
Cutting: Some of the exciting trends we are seeing are in deployment; the packaging of software into containers is going to really help accelerate a lot of technology adoption because you can much more easily deploy different tools. The cloud is really a place where more and more people are moving. It doesn’t seem new, but the majority of our customers are running on premises. So really supporting a full stack in the cloud that can be managed and take advantage of the elasticity there is something we are really focused on. And providing a vendor-neutral cloud, so you don’t have to be locked into an Amazon or a Microsoft or a Google but retaining the option to move your data and operations, we think is important.
I have a hard time predicting what the next hot software project will be, but it’s pretty clear what the next hot hardware area is: memory technologies that are coming out that give you orders of magnitude faster access to persistent storage and also, combined with that, hardware that lets you access that memory over a network without involving the remote CPU, so basically every machine on a cluster can have micro-second-level access to all the memory in that cluster. And that’s going to be a game changer. All the systems we have today are designed around the performance characteristics of existing hard drives and existing VRAM. And when you change something that fundamental, the software systems are going to need to evolve. And maybe we’ll see some entirely new database systems built that really take advantage of this. So that’s a big area to watch.
What inspired naming Hadoop after the toy elephant?
Cutting: I had been in the software business for many years and I knew that naming things is really important. One of the most important things is that you have something that’s meaningless because your mission might change and you don’t want to be stuck with a name that doesn’t match what you are doing. A descriptive name actually works against you. You want to be able to have your project provide the meaning to it. But you also need something that’s easy to pronounce and easy to spell. And we had seen various other names coined by children. So when my kid named this elephant this, I thought, “Wow, that would be a good software project name.” It’s cute, and it even comes with a mascot, because you also have to come up with a logo and a mascot for a project. Oftentimes, this takes a long time, there are a lot of things to argue about, and I thought, “Here’s a package deal: I have a yellow elephant with a name. The next time I need a project name, I am going to bring this out” and we can avoid a lot of those arguments. I did that around 2003. I was sitting on that for three years and when, in 2006, we needed a project name as we pulled what became HDFS and MapReduce out of Nutch into a project of their own, we needed a name. I said, “I got one.”
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.