Hadoop is the open source software framework at the heart of much of the Big Data and analytics revolution. It provides solutions for enterprise data storage and analytics with almost unlimited scalability. Since its release in 2011, it has rapidly grown in popularity, and a strong ecosystem of distributors, vendors, and consultants has emerged to support its use across industry.
At its core, Hadoop is an open source system, which, among other considerations, means it is essentially free for anyone to use. However, the necessity for individual organizations to align it with their needs has resulted in the emergence of many commercial distributions. These generally come packaged with support or additional features designed to streamline its deployment or allow users to build additional analytics, security, or data handling into their framework.
Competition in this market is fierce, and the landscape is constantly shifting – for example, all the top distributions now include the Apache Spark parallel processing framework, whereas a few years ago this was not the case. The growing prominence of Spark has resulted in many vendors increasing the resources dedicated to Spark deployment and support.
One important factor to consider when choosing a Hadoop distribution is whether you want an on-premises or cloud-based solution. If there is no room to compromise when it comes to maintaining complete control and ownership of your data, an on-site solution still theoretically offers the highest level of security. In recent years, though, cloud solutions have become less expensive, more flexible, and easier to scale.
Most of the vendor products listed here can be installed on a cloud or on-premises. However, some cannot be run on-site. These are generally products from web service providers, such as Amazon or Microsoft, running either Hadoop distributions from other, platform-focused vendors such as Hortonworks or MapR, or their own distributions.
Beyond that, all of the top distributions have subtle differences that could make them more or less suitable for your business. Here’s a non-exhaustive guide to some of the most popular ones on the market today.
Amazon Elastic Map Reduce
Amazon offers a cloud-only Hadoop-as-a-Service platform through its Amazon Web Services arm. A key advantage of the pay-as-you-go model offered by cloud-only service providers is the scalability available, with storage and data processing able to be ramped up or wound down as demands change. Amazon has recently announced that customers can now use the Apache Flink stream processing framework for real-time data analytics on the platform, along with other popular tools such as Kafka and Presto. It also seamlessly connects (as you would expect) with Amazon’s other cloud services infrastructure such as EC2 for cloud processing, Amazon S3 and DynamoDB for storage, and AWS IoT for collecting data from Internet of Things-enabled devices.
Cloudera was the first vendor to offer Hadoop as a package and continues to be a leader in the industry. Its Cloudera CDH distribution, which contains all the open source components, is the most popular Hadoop distribution. Cloudera is known for acting quickly to innovate with additions to the core framework – it was the first to offer SQL-for-Hadoop with its Impala query engine. Other additions include user interface, security, and interfaces for integration with third party applications. It offers support for the whole of the distribution through its Cloudera Enterprise subscription service.
Hortonworks platform is entirely open source – in fact, the company is known for making acquisitions of other companies with useful code and releasing the code into the open source community. What some have seen as a start of a trend towards consolidation in the market has prompted a growth in popularity of Hortonworks’ product. Just this year Pivotal stopped development of its own distribution, and both Amazon and IBM are now offering Hortonworks as options on their own platforms alongside their own Hadoop distributions. Hortonworks’ platform is also at the core of the Open Data Platform Initiative – a group looking to simplify and standardize specifications in the Big Data ecosphere. In the long run this is likely to mean it will become even more widely supported.
Microsoft’s Azure HDInsight platform is a cloud-only service that offers managed installations of several open source Hadoop distributions, including Hortonworks, Cloudera, and MapR. It integrates them with its own Azure Data Lake platform to offer a complete solution for cloud-based storage and analytics. In addition to the core Hadoop framework, HDInsights provides Spark, Hive, Kafka, and Storm cloud services as well as its own cloud security framework.
Like Hortonworks and Cloudera, MapR is a platform-focused provider, rather than a managed service provider like Amazon or Microsoft. MapR integrates its own database system – MapR-DB — which it claims is between four and seven times faster than the stock Hadoop database, HBase, running on competing distributions. Due to its power and speed, MapR is often seen as a good choice for the biggest of Big Data projects.
Acquired this year by SAP for $125 million, Altiscale is another company offering cloud-based, managed Hadoop-as-a-Service. It continues to offer its Altiscale Data Cloud product, which includes additional operational services like automation, security, scaling, and performance-tuning alongside the core Hadoop framework. Data Cloud also provides managed Spark, Hive, and Pig services – like most of the other products here. But unlike the other as-a-service offerings, it uses its own Hadoop distribution instead of that of one of the platform-focused vendors such as Hortonworks or MapR.
As with the entire big data ecosystem, things are constantly evolving, and I will keep a close eye on the developments over the coming months. In the meantime, I hope that this article has provided some clarity about the current state of commercial Hadoop distributions.
Bernard Marr is a bestselling author, keynote speaker, strategic performance consultant, and analytics, KPI, and big data guru. In addition, he is a member of the Data Informed Board of Advisers. He helps companies to better manage, measure, report, and analyze performance. His leading-edge work with major companies, organizations, and governments across the globe makes him an acclaimed and award-winning keynote speaker, researcher, consultant, and teacher.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.