Editor’s note: This article is part of a series examining issues related to evaluating and implementing big data analytics in business.
Most, if not all big data applications achieve their performance and scalability through deployment on a collection of storage and computing resources bound together within a runtime environment. In essence, the ability to design, develop, and implement a big data application is directly dependent on an awareness of the architecture of the underlying computing platform, both from a hardware and, more importantly, from a software perspective.
One commonality among the different appliances and frameworks is the adaptation of tools to leverage the combination of collections of four key computing resources:
- Processing capability, often referred to as a CPU, processor, or node. Generally speaking, modern processing nodes often incorporate multiple cores that are individual CPUs that share the node’s memory and are managed and scheduled together, allowing multiple tasks to be run simultaneously; this is known as multithreading.
- Memory, which holds the data that the processing node is currently working on. Most single-node machines have a limit to the amount of memory.
- Storage, providing persistence of data – the place where datasets are loaded, and from which the data is loaded into memory to be processed.
- The network, which provides the “pipes” through which datasets are exchanged between different processing and storage nodes.
Because single-node computers are limited in their capacity, they cannot easily accommodate massive amounts of data. That is why high performance platforms are composed of collections of computers in which the massive amounts of data and requirements for processing can be distributed among a pool of resources.
A General Overview of High Performance Architecture
Most high performance platforms are created by connecting multiple nodes together via a variety of network topologies. Specialty appliances may differ in the specifics of the configurations, as do software appliances. However, the general architecture distinguishes the management of computing resources (and corresponding allocation of tasks) and the management of the data across the network of storage nodes, as is seen in the figure below:
In this configuration, a master job manager oversees the pool of processing nodes, assigns tasks, and monitors the activity. At the same time, a storage manager oversees the data storage pool and distributes datasets across the collection of storage resources. While there is no a priori requirement that there be any colocation of data and processing tasks, it is beneficial from a performance perspective to ensure that the threads process data that is stored in a way that is directly local to the node upon which the thread executes, or is stored on a node that is close to it. Reducing the costs of data access latency through co-location improves performance speed.
To get a better understanding of the layering and interactions within a big data platform, we will examine aspects of the Apache Hadoop software stack, since the architecture is published and open for review. Hadoop is essentially a collection of open source projects that are combined to enable a software-based big data appliance. We begin with a core aspect of Hadoop’s utilities, upon which the next layer in the stack is propped, namely HDFS, or the Hadoop Distributed File System.
How HDFS Works
HDFS attempts to enable the storage of large files, and does this by distributing the data among a pool of data nodes. A single name node (sometimes referred to as NameNode) runs in a cluster, associated with one or more data nodes, and provides the management of a typical hierarchical file organization and namespace. The name node effectively coordinates the interaction with the distributed data nodes. The creation of a file in HDFS appears to be a single file, even though it blocks “chunks” of the file into pieces that are stored on individual data nodes.
The name node maintains metadata about each file as well as the history of changes to file metadata. That metadata includes an enumeration of the managed files, properties of the files and the file system, as well as the mapping of blocks to files at the data nodes. The data node itself does not manage any information about the logical HDFS file; rather, it treats each data block as a separate file and shares the critical information with the name node.
Once a file is created, as data is written to the file, it is actually cached in a temporary file. When the amount of the data in that temporary file is enough to fill a block in an HDFS file, the name node is alerted to transition that temporary file into a block that is committed to a permanent data node, which is also then incorporated into the file management scheme.
HDFS provides a level of fault tolerance through data replication. An application can specify the degree of replication (that is, the number of copies made) when a file is created. The name node also manages replication, attempting to optimize the marshaling and communication of replicated data in relation to the cluster’s configuration and corresponding efficient use of network bandwidth. This is increasingly important in larger environments consisting of multiple racks of data servers, since communication among nodes on the same rack is generally faster than between server nodes in different racks. HDFS attempts to maintain awareness of data node locations across the hierarchical configuration.
In essence, HDFS provides performance through distribution of data and fault tolerance through replication, the result is a level of robustness for reliable massive file storage. Enabling this level of reliability should be facilitated through a number of key tasks for failure management, some of which are already deployed within HDFS while others are not currently implemented:
- Monitoring: There is a continuous “heartbeat” communication between the data nodes to the name node. If a data node’s heartbeat is not heard by the name node, the data node is considered to have failed and is no longer available. In this case, a replica is employed to replace the failed node, and a change is made to the replication scheme.
- Rebalancing: This is a process of automatically migrating blocks of data from one data node to another when there is free space, when there is an increased demand for the data and moving it may improve performance (such as moving from a traditional disk drive to a Solid State drive that is much faster or can accommodate increased numbers of simultaneous accesses), or an increased need to replication in reaction to more frequent node failures.
- Managing integrity: HDFS uses checksums, which are effectively “digital signatures” associated with the actual data stored in a file (often calculated as a numerical function of the values within the bits of the files) that can be used to verify that the data stored corresponds to the data shared or received. When the checksum calculated for a retrieved block does not equal the stored checksum of that block, it is considered an integrity error. In that case, the requested block will need to be retrieved from a replica instead.
- Metadata replication: The metadata files are also subject to failure, and HDFS can be configured to maintain replicas of the corresponding metadata files to protect against corruption.
- Snapshots: This is incremental copying of data to establish a point in time to which the system can be rolled back. This is not currently supported.
These concepts map to specific internal protocols and services that HDFS uses to enable a large-scale data management file system that can run on commodity hardware components. The ability to use HDFS solely as a means for creating a scalable and expandable file system for maintaining rapid access to large datasets provides a reasonable value proposition from an Information Technology perspective: decreasing the cost of specialty large scale storage systems, reliance on commodity components, the ability to deploy using cloud-based services, and even lowered system management costs.
David Loshin is the author of several books, including Practitioner’s Guide to Data Quality Improvement and the second edition of Business Intelligence—The Savvy Manager’s Guide. As president of Knowledge Integrity Inc., he consults with organizations in the areas of data governance, data quality, master data management and business intelligence. Email him at email@example.com.