Two and a half quintillion bytes: that’s the amount of data generated every day across the globe. It’s a staggering number. Companies everywhere are grappling with data that’s bigger, faster, and more varied than anything they have ever encountered before. And many of them are increasingly relying on Hadoop to help manage this data and implement large-scale, business-critical data projects.
Hadoop has proven itself to be a critical piece of organizations’ data infrastructure across numerous industries, from retail to banking to social media and beyond. However, as companies expand their use of Hadoop for production applications, Hadoop’s unpredictability becomes an increasingly critical business issue. Currently, Hadoop lacks several capabilities that are essential for an enterprise-ready platform.
As one of the earliest adopters of Hadoop (I managed the Yahoo Search Technology team, the first production user of Hadoop in the world), I’m all too familiar with these and other limitations of Hadoop that impact businesses. Some of Hadoop’s most common operational challenges often create serious – and costly – business obstacles for organizations.
Hadoop is an effective platform for processing massive volumes of data with unprecedented speed on low-cost commodity hardware, but it does have some notable limitations. For example, Hadoop can usually ensure that a data job completes, but it is unable to guarantee when the job will be completed. Hadoop jobs often take longer to run than anticipated, making it risky to depend on the job output in production applications. When a critical production job is running, other, lower-priority jobs can sometimes swallow up the cluster’s hardware resources, like disk and network, creating serious resource contentions that ultimately can result in critical production jobs failing to complete safely and on time.
In addition, the Hadoop ecosystem quite simply lacks the visibility and control that operators and administrators need to diagnose, fix, and prevent these problems. Because Hadoop does not continuously monitor and control each job’s use of key hardware resources, a single, low-priority job can dominate the cluster, slowing down more critical production jobs and causing missed SLAs. To deal with this problem, operators often end up building Hadoop clusters that are bigger than they need to be. They even create separate clusters – and that extra capacity goes unused most of the time.
Another common workaround is just as problematic: Because organizations and operators are worried about low-priority jobs interfering with high-priority jobs, they often place strict limitations on users who want to run new jobs on the cluster. This can delay or even eliminate the real benefits that companies can get from using data in their businesses.
A Hadoop Wish List
Individual Hadoop practitioners likely have their own wish lists that are specific to their company’s unique circumstances, but there are some desires that are common to the majority of Hadoop administrators. For instance, all Hadoop administrators would benefit tremendously from an increase in overall visibility and control, including the ability to set priorities based on specific users and jobs, as well as the ability to monitor cluster performance at both a granular level and at a macro level. Additionally, administrators would benefit greatly from the ability to automate the enforcement of priorities and efficiently allocate resources across the cluster. This increase in clarity and control would enable companies to get more capacity out of their existing clusters, and allow Hadoop administrators to ensure that business-critical Hadoop applications are not compromised. These tools, however, must be incorporated seamlessly: without a cumbersome installation process and no disruption to the existing cluster environment. They must work with existing schedulers, cluster managers, and resource managers alike, including YARN.
Hadoop has earned a reputation as a breakthrough technology for grappling with the staggering amount of data that companies generate and collect each day. Yet the Hadoop limitations that we experienced years ago at Yahoo remain relevant today and, in fact, motivated us to start Pepperdata.
We look forward to the day when companies can fully rely on Hadoop – with control, visibility, and predictability for the most important jobs. Enterprise-quality Hadoop has the potential to enable companies to take advantage of big data and grow their businesses in new and exciting directions that cannot even be imagined today.
Sean Suchter is the co-founder and CEO of Pepperdata.