Editor’s note: This article is part of a series examining issues related to evaluating and implementing big data analytics in business.
There are often some misconceptions surrounding the presumed ease of developing big data applications, especially when the barriers to acquiring big data system software are lowered through open source availability. While it may be straightforward to download and install the core components of a big data development and execution environment like Hadoop and MapReduce, designing, developing, and deploying analytic applications still requires some skill and expertise.
As noted in previous articles on application development and distributed file systems, one must recognize that there is a difference between different big data management tasks: architecting the big data system, selecting and connecting its components, building the actual big data environment, developing and implementing applications that run on the big data platform, system configuration, system monitoring, and continued maintenance.
There is a need to expand the ecosystem to incorporate a variety of additional capabilities, such as configuration management, data organization, application development and optimization, as well as additional capabilities to support analytics. This article examines the prototypical big data platform using Hadoop, and how Hadoop-related projects address these pieces of the puzzle:
• Synchronization and coordination of process and object namespace across different applications and assets.
• Data management, which has a number of alternatives. This article focuses on Hive and HBase; a future article will discuss a variety of big data management schemes.
• Programing ease-of-use, and methods to simplify MapReduce programming.
• Analytics, or rather some implementations of algorithms that can be used to develop analytical models.
Zookeeper: Synchronization and Coordination of Multiple Tasks
Whenever there are multiple tasks and jobs running within a single distributed environment, there is a need for configuration management and synchronization of various aspects of naming and coordination. Zookeeper is an open-source Apache Foundation project that “is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.”
Zookeeper manages a naming registry and effectively implements a system for managing the various statically- and dynamically-named objects in a hierarchical manner, much like a file system. In addition, it enables coordination for exercising control over shared resources (such as distributed files or shared data tables) that may be touched by multiple application processes or tasks.
This is necessary to protect against impacts that are caused by situations in which the expected results of an application are impacted by variations in timing associated with individual tasks. This is referred to as a “race condition” and it happens when multiple tasks attempt to read from and/or write to the same data object. Without any assurance of order that the tasks access the object, during some executions certain writes will take place before certain reads, while the opposite may hold during other executions, resulting in unpredictability in the results. The same control methods are used to prevent what is called “deadlock.” This happens when multiple tasks vie for control of the same resource, but when none can get complete access, they all effectively lock each other out of any ability to use the resource.
Most computer science students will learn about these kinds of controls as part of a typical curriculum, but few individuals actually ever develop them, since they are typically implemented within the recesses of most modern operating systems. Shared coordination services like those provided in Zookeeper allow developers to employ these controls without having to develop them from scratch.
As an example of how Zookeeper can be used, consider a MapReduce application that shares a task queue among a collection of executing threads. That task queue is a shared resource that requires coordination, since simultaneous attempts to append a new task to the end of the queue could overwrite each other, potentially losing those overwritten tasks. The controls in Zookeeper can be used for establishing a protective form of mutual exclusion for access to the task queue. If there are concurrent attempts to access the task queue, the Zookeeper controls will serialize the accesses and ensure that every thread sees those appends take place, and that they are seen to have taken place in the same order. This guarantees that applications run consistently.
HBase: Data Management for the Hadoop Framework
HBase is an example of a non-relational data management environment that distributes massive datasets over the underlying Hadoop framework. HBase is derived from Google’s BigTable, and is a column-oriented data layout that, when layered on top of Hadoop, provides a fault-tolerant method for storing and manipulating large data tables.
HBase is not a relational database, and it does not support SQL queries. There are some basic operations for HBase: Get (which accesses a specific row in the table), Put (which stores or updates a row in the table), Scan (which iterates over a collection of rows in the table), and Delete (which removes a row from the table). HBase’s approach to data organization includes optimizations associated with preferred data layouts (such as columnar data alignment, which speeds data access as well as enables opportunities for data compression), and is a reasonable choice to use for persistent data storage and management for MapReduce applications.
A good example of an application that uses a data management framework like HBase is a health care provider collecting clinical results associated with their patients. There may be a large number of clinical tests that might be performed – blood work, imaging scans, endoscopies, for example. However, attempting to create a relational data schema that could accommodate any clinical exam that might be performed would result in a potentially very wide, yet sparsely populated table. HBase provides a method for schema-less modeling, allowing the data practitioner to attach clinical test results (such as lab work results, X-ray images, or magnetic scan images, for example) to a unique patient’s virtual electronic health history in a way that enables rapid storage and rapid access, but also allows analysts to run limited types of queries over the entire collection of clinical results to help in diagnostics and to look for optimal plans of delivering care.
Hive: A Data Warehouse System for Hadoop
Even though MapReduce provides a methodology for developing and executing applications that use massive amounts of data, it does not provide any data management capabilities itself. Data can be managed in Hadoop Distributed File System (HDFS) files, but there are still needs for structured database tables to support business intelligence and reporting applications, and this was the motivation for the development of Hive. Another Apache project, Hive is a “data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.”
Hive was engineered as a platform for batch data warehouse querying and reporting, but does not support the real-time query execution or transaction semantics for consistency at the row level typically necessary for transaction processing. Hive is layered on top of the file system and execution framework for Hadoop, and enables applications and users to organize data in a structured data warehouse and therefore query the data using a query language called HiveQL. HiveQL is similar to the SQL query language generally used for accessing modern relational database management systems.
Hive provides tools for extracting, transforming and loading (ETL) data into a variety of different data formats. Because it is tightly coupled with Hadoop, it enables native access to the MapReduce model. This means that programmers can develop custom Map and Reduce functions that can be directly integrated into HiveQL queries. Hive leverages Hadoop’s scalability as well as its fault-tolerance capabilities to enable batch-style queries for reporting over large (and expanding) datasets.
An example of a business intelligence and reporting application that relies on a traditional data warehouse architecture involves Web log statistics analysis. In some online retail environments, webpage requests, server requests, database requests, webpage deliveries, and user actions all can be composed in a data warehouse to allow assessment of optimal page placement, demographic user characteristics, behavior trends, and opportunities for increasing sales. However, often the volume of these transactions will overwhelm a typical server-based platform; using Hive expands the potential storage footprint and enables a broad collection of transactions over longer time periods for better trend and behavior reporting and analysis.
Pig: A Means to Simplify the Development Process
Even though the MapReduce programming model is relatively straightforward, developers still need skill in both parallel and distributed programming and Java to build applications that effectively exploit the architecture. The Pig project is an attempt to simplify the application development process by abstracting some of the details away through a higher-level programming language called Pig Latin. According to the Pig website, Pig’s high-level programming language allows the developer to specify how the analysis is performed. In turn, a compiler transforms the Pig Latin specification into MapReduce programs.
Embedding a set of parallel operators and functions within a control sequence of directives applied to massive datasets provides a means for data manipulation that is similar to the way SQL statements are applied to traditional RBDMS systems. Some of these parallel operators include generating datasets, filtering out subsets, joins, splitting datasets, and removing duplicates. For simple applications, using Pig provides significant ease of development, and more complex tasks can be engineered as sequences of applied operators.
In addition, the use of a high-level language also allows the compiler to identify opportunities for optimization that might have been ignored by an inexperienced programmer. At the same time, the Pig environment allows developers to create new user-defined functions (UDFs) that can subsequently be incorporated into developed programs.
Mahout: A Library of Machine Learning Algorithms
Attempting to use big data for analytics would be limited without any analytics capabilities. Mahout is a project to provide a library of scalable implementations of machine learning algorithms on top of MapReduce and Hadoop. Mahout’s library includes numerous well-known analysis methods including:
• Collaborative filtering and other user- and item-based recommender algorithms, which is used to make predictions about an individual’s interest or preferences through comparison with a multitude of others that may or may not share similar characteristics.
• Clustering, including algorithms to look for groups, patterns, and commonality among selected cohorts in a population.
• Categorization algorithms used to place items into already-defined categories.
• Text mining and topic modeling algorithms for scanning text and assigning contextual meanings.
• Frequent pattern mining, which is used for market basket analysis, comparative health analytics, and other patterns of correlation within large data sets.
The availability of implemented libraries for these types of analytics free the development team to consider the types of problems to be analyzed and more specifically, the types of analytical models that can be applied to seek the best answers.
As an example, the collaborative filtering algorithms implemented in the Mahout libraries can be used to build a recommendation engine. The programmer defines the attributes of individuals that characterize similarity (such as income, age, and geographic location). That similarity scoring mechanism can be plugged into your Mahout-based application to compare similarities across your large dataset, and find “nearest neighbors,” or sets on individuals that share similar characteristics. Each of the individuals will have specified some product preferences (for example, through purchasing an item, or reviewing the item favorably). When you want to generate some recommendations for a specific customer, the model is searched for the nearest neighbors. The product preferences of those neighbors are surveyed, scored by level of preference, and then sorted in descending order, providing a list of recommended products to be presented to the customer.
The need to design and deploy business solutions suggests that transitioning from an experimental “laboratory” system into a production environment demands more than just downloading software and enabling access to a system composing computing, memory, storage, and network resources. Fortunately, the Hadoop framework provides these, and other additional components that would typically be necessary as part of the big data ecosystem.
David Loshin is the author of several books, including Practitioner’s Guide to Data Quality Improvement and the second edition of Business Intelligence—The Savvy Manager’s Guide. As president of Knowledge Integrity Inc., he consults with organizations in the areas of data governance, data quality, master data management and business intelligence. Email him at firstname.lastname@example.org.