Understanding the Big Data Stack: Hadoop’s Distributed File System

by   |   July 15, 2013 4:22 pm   |   0 Comments

Editor’s note: This article is part of a series examining issues related to evaluating and implementing big data analytics in business.

Most, if not all big data applications achieve their performance and scalability through deployment on a collection of storage and computing resources bound together within a runtime environment. In essence, the ability to design, develop, and implement a big data application is directly dependent on an awareness of the architecture of the underlying computing platform, both from a hardware and, more importantly, from a software perspective.

Related Stories

A look at Hadoop’s future with MapR’s John Schroeder.
Read the story »

Hadoop project Falcon: Data lifecycle management for app developers.
Read the story »

Hadoop sandboxes provide low-risk entry for new programmers.
Read the story »

More on Hadoop.
Read the story »

One commonality among the different appliances and frameworks is the adaptation of tools to leverage the combination of collections of four key computing resources:

  • Processing capability, often referred to as a CPU, processor, or node. Generally speaking, modern processing nodes often incorporate multiple cores that are individual CPUs that share the node’s memory and are managed and scheduled together, allowing multiple tasks to be run simultaneously; this is known as multithreading.
  • Memory, which holds the data that the processing node is currently working on. Most single-node machines have a limit to the amount of memory.
  • Storage, providing persistence of data – the place where datasets are loaded, and from which the data is loaded into memory to be processed.
  • The network, which provides the “pipes” through which datasets are exchanged between different processing and storage nodes.

Because single-node computers are limited in their capacity, they cannot easily accommodate massive amounts of data. That is why high performance platforms are composed of collections of computers in which the massive amounts of data and requirements for processing can be distributed among a pool of resources.

A General Overview of High Performance Architecture
Most high performance platforms are created by connecting multiple nodes together via a variety of network topologies. Specialty appliances may differ in the specifics of the configurations, as do software appliances. However, the general architecture distinguishes the management of computing resources (and corresponding allocation of tasks) and the management of the data across the network of storage nodes, as is seen in the figure below:

Typical organization of resources in a big data platform.

Typical organization of resources in a big data platform.

In this configuration, a master job manager oversees the pool of processing nodes, assigns tasks, and monitors the activity. At the same time, a storage manager oversees the data storage pool and distributes datasets across the collection of storage resources. While there is no a priori requirement that there be any colocation of data and processing tasks, it is beneficial from a performance perspective to ensure that the threads process data that is stored in a way that is directly local to the node upon which the thread executes, or is stored on a node that is close to it. Reducing the costs of data access latency through co-location improves performance speed.

To get a better understanding of the layering and interactions within a big data platform, we will examine aspects of the Apache Hadoop software stack, since the architecture is published and open for review. Hadoop is essentially a collection of open source projects that are combined to enable a software-based big data appliance. We begin with a core aspect of Hadoop’s utilities, upon which the next layer in the stack is propped, namely HDFS, or the Hadoop Distributed File System.

How HDFS Works
HDFS attempts to enable the storage of large files, and does this by distributing the data among a pool of data nodes. A single name node (sometimes referred to as NameNode) runs in a cluster, associated with one or more data nodes, and provides the management of a typical hierarchical file organization and namespace. The name node effectively coordinates the interaction with the distributed data nodes. The creation of a file in HDFS appears to be a single file, even though it blocks “chunks” of the file into pieces that are stored on individual data nodes.

More Articles in This Series

Market and Business Drivers for Big Data Analytics

To best understand what “big data” can mean to your organization, start by understanding the conditions that has led to its growing acceptance. In this article, the first in a series, David Loshin explains the economic drivers that make new analytics applications worth evaluating given today’s exploding data volumes, and the technology innovations that make such systems more accessible to more companies.

Business Problems Suited to Big Data Analytics

Enterprises need clear processes for determining the value proposition of a big data analytics project. In this article, David Loshin examines the applications that make sense for these projects and the criteria that enterprises should use to weigh the costs and benefits of such a strategic investment.

Achieving Organizational Alignment for Big Data Analytics

Numerous aspects of big data analytics hold appeal, and while individuals within an organization can “test drive” them, these new technologies need to win adoption in a broader enterprise setting. Managers need to answer: What is the process for piloting technologies to determine their feasibility and business value? And: What must happen to bring big data analytics into organization’s system development lifecycle?

Developing a Strategy for Integrating Big Data Analytics into the Enterprise

As with any innovative technology that promises business value, there is a rush to embrace big data analytics as a key source of business value. This article explains how to consider the challenges and issues involved in bringing big data analytics into production.

Data Governance for Big Data Analytics: Considerations for Data Policies and Processes

With emerging big data use cases, datasets created for one purpose can be used for an entirely different purpose—a dynamic that challenges traditional approaches to data governance. This article explores ways to manage this conflict and build new governance policies.

Considerations for Storage, Appliances and NoSQL Systems for Big Data Analytics Management

Big data management and analytics applications rely on an ecosystem of components that can be combined in a variety of ways to address application requirements. This article examines three aspects of this ecosystem and associated technologies: storage, appliances, and data management.

An Introduction to Big Data Application Development and MapReduce

For any target big data platform, you must have an application development framework that supports a system development lifecycle and provides a means for loading and executing the developed application. This article discusses the principles involved and how programmers use the MapReduce and ECL frameworks to analyze big datasets.

The name node maintains metadata about each file as well as the history of changes to file metadata. That metadata includes an enumeration of the managed files, properties of the files and the file system, as well as the mapping of blocks to files at the data nodes. The data node itself does not manage any information about the logical HDFS file; rather, it treats each data block as a separate file and shares the critical information with the name node.

Once a file is created, as data is written to the file, it is actually cached in a temporary file. When the amount of the data in that temporary file is enough to fill a block in an HDFS file, the name node is alerted to transition that temporary file into a block that is committed to a permanent data node, which is also then incorporated into the file management scheme.

HDFS provides a level of fault tolerance through data replication. An application can specify the degree of replication (that is, the number of copies made) when a file is created. The name node also manages replication, attempting to optimize the marshaling and communication of replicated data in relation to the cluster’s configuration and corresponding efficient use of network bandwidth. This is increasingly important in larger environments consisting of multiple racks of data servers, since communication among nodes on the same rack is generally faster than between server nodes in different racks. HDFS attempts to maintain awareness of data node locations across the hierarchical configuration.

In essence, HDFS provides performance through distribution of data and fault tolerance through replication, the result is a level of robustness for reliable massive file storage. Enabling this level of reliability should be facilitated through a number of key tasks for failure management, some of which are already deployed within HDFS while others are not currently implemented:

    • Monitoring: There is a continuous “heartbeat” communication between the data nodes to the name node. If a data node’s heartbeat is not heard by the name node, the data node is considered to have failed and is no longer available. In this case, a replica is employed to replace the failed node, and a change is made to the replication scheme.


    • Rebalancing: This is a process of automatically migrating blocks of data from one data node to another when there is free space, when there is an increased demand for the data and moving it may improve performance (such as moving from a traditional disk drive to a Solid State drive that is much faster or can accommodate increased numbers of simultaneous accesses), or an increased need to replication in reaction to more frequent node failures.


    • Managing integrity: HDFS uses checksums, which are effectively “digital signatures” associated with the actual data stored in a file (often calculated as a numerical function of the values within the bits of the files) that can be used to verify that the data stored corresponds to the data shared or received. When the checksum calculated for a retrieved block does not equal the stored checksum of that block, it is considered an integrity error. In that case, the requested block will need to be retrieved from a replica instead.


    • Metadata replication: The metadata files are also subject to failure, and HDFS can be configured to maintain replicas of the corresponding metadata files to protect against corruption.


  • Snapshots: This is incremental copying of data to establish a point in time to which the system can be rolled back. This is not currently supported.


Hadoop Resources

Some very good information is available in Jeff Hanson’s article “An Introduction to the Hadoop Distributed File System,” accessed via the IBM website.

These concepts map to specific internal protocols and services that HDFS uses to enable a large-scale data management file system that can run on commodity hardware components. The ability to use HDFS solely as a means for creating a scalable and expandable file system for maintaining rapid access to large datasets provides a reasonable value proposition from an Information Technology perspective: decreasing the cost of specialty large scale storage systems, reliance on commodity components, the ability to deploy using cloud-based services, and even lowered system management costs.

David Loshin is the author of several books, including Practitioner’s Guide to Data Quality Improvement and the second edition of Business Intelligence—The Savvy Manager’s Guide. As president of Knowledge Integrity Inc., he consults with organizations in the areas of data governance, data quality, master data management and business intelligence. Email him at loshin@knowledge-integrity.com.

Tags: , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>