With data production accelerating to unprecedented rates, many organizations have turned to Hadoop as an inexpensive way to store and process that data. But those new to Hadoop often find themselves confronting a technology as inscrutable as its name. What is Hadoop? What does it do? What’s with the elephant?
To put it simply, Hadoop is an open-source distributed platform for storing and processing large data sets. Inspired by research papers published by Google, Doug Cutting and Mike Cafarella created Hadoop in 2005, when Cutting was working at Yahoo on its search platform.
Named after a toy elephant owned by Cutting’s son, Hadoop helps control big data processing costs by distributing computing across commodity servers instead of using expensive, specialized servers. Commodity hardware refers to machines that are inexpensive, already available or easily obtainable, and interchangeable with similar hardware. Commodity computing is seen as a more scalable solution because purchasing additional high-cost, high-performance servers to keep up with the growth of big data quickly becomes prohibitively expensive.
After creating Hadoop, Yahoo turned it over to the Apache Software Foundation, where it is maintained as an open-source project with a global community of users and contributors. The Apache Hadoop framework is made up of the following modules:
- Hadoop Distributed File System (HDFS), a Java-based distributed file system designed to store data on commodity servers. HDFS stores files by dividing them into smaller blocks and replicating them on three or more servers.
- MapReduce, a batch programming model that uses parallel processing – basically, using more than one CPU to execute a task simultaneously – to make processing big data faster, cheaper, and more manageable.
- Hadoop YARN, a resource-management and scheduling platform. YARN removes the resource management and scheduling responsibilities from MapReduce, optimizing cluster utilization and allowing MapReduce to focus on data processing.
- Hadoop Common, which consists of the libraries and utilities required by other Hadoop modules.
Beyond these core components, Hadoop includes an entire ecosystem of technologies based on HDFS, such Apache HBase (a key-value data store inspired by Google’s Big Table), Apache Hive (a SQL-based, analytic query engine), Apache Pig (procedural language), Apache Spark (a fast, in-memory engine for large-scale data processing), and even commercial technologies such as Splice Machine (a Hadoop RDBMS with joins and transactions).
A typical Hadoop environment comprises a master node and several worker nodes. Most Hadoop deployments consist of several master node instances to mitigate the risk of a single point of failure. A Hadoop environment can include hundreds or even thousands of worker nodes.
How It Works
Hadoop differs from a traditional relational database in that it is not, strictly speaking, a database but rather a storage and batch data processing system. In a traditional relational database, data queries are conducted using Structured Query Language (SQL) queries. Hadoop, on the other hand, originally did not use SQL or Not Only SQL (NoSQL) queries, but required Java MapReduce programs for data processing.
MapReduce speeds data processing by bringing the processing software to the data as opposed to moving massive amounts of data to the processing software, as is done in traditional data processing architectures. Data processing, therefore, is distributed, or “mapped,” to each node.
MapReduce, in name and function, is a combination of two processing steps: Map and Reduce.
In the Map step, records from the data source are split into bundles for each Map server and are fed into the map() function on that server. Each map function produces a list of results. Each result set is then sent to one reduce server. In the Reduce step, each Reduce server runs a reduce() function over the results lists sent from a subset of map nodes. Then the reducers combine the results of the reduce() runs into a final result.
Using MapReduce is more complex than using a standard database query because it requires the abstraction of tasks into map() and reduce() functions in Java, but tools like Hive can help reduce that complexity by converting a SQL-like query language into MapReduce jobs.
How It’s Used
Use cases for Hadoop are as varied as the organizations using Hadoop. Companies have used Hadoop to analyze customer behavior on their websites, process call center activity, and mine social media data for sentiment analysis about themselves, their products, and their competition. Based on this data, companies can make decisions in real time to understand customer needs, mitigate problems, and ultimately gain a competitive advantage.
Hadoop has been used to bolster data security by being applied to server-log analysis and processing machine data that can be used to identify malware and cyber-attack patterns. Banks use it to store customer transaction data and search for anomalies that could identify fraud.
Other organizations might simply use Hadoop to reduce their data storage costs, as commodity hardware is much less expensive than large, specialized servers. Hadoop can be used to cut storage costs and speed up analysis time in just about any data-rich environment.
As mentioned above, Hadoop is an open-source platform. But using the free, open-source version can be a challenge for anyone who isn’t a Java programmer. There are commercial distributions of Hadoop (e.g., Cloudera, Hortonworks, MapR) that add functionality and improve reliability. There are also Hadoop-based tools that blur the once-sharp line between Hadoop and traditional RDBMS by providing Hadoop with the capabilities of a RDBMS (e.g., Splice Machine) to run business-critical operational applications.
Hadoop’s ability to store, process, and analyze large data sets in a cost-effective manner has made it the de facto architecture for today’s big data environments.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.