An Introduction to NoSQL Data Management for Big Data

by   |   September 4, 2013 3:24 pm   |   1 Comments

Editor’s note: This article is part of a series examining issues related to evaluating and implementing big data analytics in business.

As with any emerging yet complex application development framework, successful implementation relies on an ecosystem of different components that can be combined to address the development of the appropriate solution. For big data, that ecosystem revolves around some key architectural artifacts, including scalable storage, parallel computing, and data management paradigms that are united through an application development platform. And while it is important to review the ways these different components are blended together, from the perspective of application development, the interdependence between analytic algorithms and the underlying data management framework warrants a more in-depth review of big data storage paradigms.

Related Stories

The evolution of the enterprise data warehouse starring Hadoop.
Read the story »

How Pig, Hive and Zookeeper build apps on Hadoop and MapReduce.
Read the story »

Sentiment analysis tool designed to predict veterans’ suicide risk.
Read the story »

Why more data and simple algorithms beat complex analytical models.
Read the story »

The reason is straightforward: when applications already depend on a traditional relational database (RBDMS) model and/or a data warehousing approach to data management, it may be sufficient to port the RDBMS tools as a way of scaling performance on a big data appliance. However, many algorithms that expect to take advantage of a high-performance, elastic, distributed data environment are not suited to consume data in traditional RDBMS systems. That means developers must consider different methods for data management.

These analytic algorithms can employ one of an array of alternative means for data management that are typically bundled under the term “NoSQL databases.” The term “NoSQL” conveys two different concepts. The first suggests a data management framework that is not a SQL-compliant one. The second more generally acknowledged (that is, more frequently presented) meaning is that the term stands for “Not only SQL,” suggesting environments that combine traditional SQL (or SQL-like query languages) with alternative means of querying and access.

“Schema-less Models”: Increasing Flexibility for Data Manipulation
NoSQL data systems provide a more relaxed approach to data modeling often referred to as schema-less modeling, in which the semantics of the data are embedded within a flexible connection topology and a corresponding storage model. This provides greater flexibility for managing large data sets while simultaneously reducing the dependence on the more formal database structure imposed by the relational database systems.

The flexible model enables automatic distribution of data and elasticity with respect to the use of computing, storage, and network bandwidth in ways that don’t force specific binding of data to be persistently stored in particular physical locations. NoSQL databases also provide for integrated data caching that helps reduce data access latency and speed performance.

The loosening of the relational structure is intended to allow different models to be adapted to specific types of analyses. For example, some are implemented as key-value stores, which nicely align to big data programming models like MapReduce. Although the “relaxed” approach to modeling and management paves the way for performance improvements for analytical applications, it does not enforce adherence to strictly-defined structures, and the models themselves do not necessarily impose any validity rules. This potentially introduces risks associated with ungoverned data management activities such as inadvertent inconsistent data replication, reinterpretation of semantics, and currency and timeliness issues. This article discusses four different NoSQL approaches:

  • Key-value stores
  • Document stores
  • Tabular stores
  • Object stores



Key-Value Stores
A key-value store is a schema-less NoSQL model in which data objects are associated with distinct character strings called keys, similar to the data structure known as a hash table. Many of the NoSQL architectures rely on variations on the key-value theme, in that unique keys are employed to both identify entities and to locate attribute information about those entities. This pervasive use of unique keys lends a degree of credibility to this basic approach to a schema-less model.

As an example, consider the data subset represented in Table 1, where the key is the name of the automobile make, while the value is a list of names of models associated with that automobile make.

Table 1: Example data represented in key-value store.
KeyValue
“BMW”{“1-Series”, “3-Series”, “5-Series”, “5-Series GT”, “7-Series”, “X3”, “X5”, “X6”, “Z4”}
“Buick”{“Enclave”, “LaCrosse”, “Lucerne”, “Regal”}
“Cadillac”{“CTS”, “DTS”, “Escalade”, “Escalade ESV”, “Escalade EXT”, “SRX”, “STS”}

The key-value store does not impose any constraints about data typing or data structure. It is the responsibility of the consuming business applications to interpret the semantics of the data organization.

The core operations performed on a key-value store include:

  • Get(key), which returns the value associated with the provided key.
  • Put(key, value), which associates the value with the key.
  • Multi-get(key1, key2, .., keyN), which returns the list of values associated with the list of keys.
  • Delete(key), which removes the entry for the key from the data store.



When using a key-value store, ensuring that the values can be accessed means that the key must be unique. To associate multiple values with a single key (such as the list of car models in the example in Table 1), the developer must consider the representations of those values and how they are to be linked to the key.

Key-value stores are essentially long, “thin” tables, and can be indexed by key value to speed data queries (in that there are not many columns associated with each row). The table’s rows can be sorted by the key value to simplify finding the key during a query. A query essentially comprises two steps: the first step is to calculate the unique key, and the second is to use that key as an index into the table. Because of the need to calculate the key to access any information about the entity, it is difficult to expect to execute general SQL-style queries such as “what are the most popular models of cars based on sales?” These kinds of questions would typically be answered using code, as opposed to a query engine.

While key-value pairs are very useful for both storing the results of analytical algorithms (such as the number of times specific phrases occur within massive numbers of documents) and for producing those results for reports, the model does pose some potential drawbacks. One weakness is that the model will not inherently provide any kind of traditional database capabilities (such as atomicity of transactions, or consistency when multiple transactions are executed simultaneously). Those capabilities must be provided by the application itself.

Another potential weakness: as the volume of data increases, maintaining unique values as keys may become more difficult; addressing this issue requires the introduction of some complexity in generating character strings that will remain unique among an extremely large set of keys. For example, a global company may attempt to manage data associated with millions of customers, many of whom sharing the same or similar names. Duplication in the set of names will mean that the name itself will be insufficient when used to differentiate different entities. The upshot is that additional data attributes will need to be added to the composed character string to be used to generate a unique key.

Document Stores
A document store is similar to a key-value store in that stored objects are associated (and therefore accessed via) character-string keys. The difference is that the values being stored, which are referred to as “documents,” provide some structure and encoding of the managed data. There are different common, standard encodings, including XML (Extensible Markup Language), JSON (Java Script Object Notation), BSON (which is a binary encoding of JSON objects). Aside from these standard approaches to packaging data, other means of linearizing the data values associated with a data record or object for the purposes of storage may be employed.

Figure 1 shows examples of data values collected as a “document” representing the names of specific retail stores. Note that while the three examples all represent locations, the representative models are different.  The document representation embeds the structure of the model, allowing the meanings of the document values to be inferred by the application.

Figure 1: Example of document store.

Figure 1: Example of document store.

One key distinction between a key-value store and a document store is that the latter embeds attribute metadata associated with stored content, which essentially provides a way to query the data based on the contents. For example, using the example in Figure 1, one could search for all documents in which “MallLocation” is “Wheaton Mall” that would deliver a result set containing all documents associated with any “Retail Store” that is in that particular shopping mall.

Tabular Stores
Tabular, or table-based stores are largely descended from Google’s original BigTable design to manage structured data. Hadoop’s HBase model is an example of a NoSQL data management system that evolved from BigTable. (For background on BigTable design, see this paper via Google’s research website. )

The tabular model allows sparse data to be stored in a three-dimensional table that is indexed by a row key (that is used in a fashion that is similar to the key-value and document stores), a column key that indicates the specific attribute for which a data value is stored, and a timestamp that may refer to the time at which the row’s column value was stored.

As an example, various attributes of a Web page can be associated with the Web page’s URL: the HTML content of the page, URLs of other web pages that link to this Web page, the author of the content. Columns in a BigTable model are grouped together as “families” and the timestamps enable management of multiple versions of an object. The timestamp can be used to maintain history—each time the content changes, new column affiliations can be created with the timestamp of the when the content was downloaded.

More Articles in This Series

Market and Business Drivers for Big Data Analytics

To best understand what “big data” can mean to your organization, start by understanding the conditions that has led to its growing acceptance. In this article, the first in a series, David Loshin explains the economic drivers that make new analytics applications worth evaluating given today’s exploding data volumes, and the technology innovations that make such systems more accessible to more companies.

Business Problems Suited to Big Data Analytics

Enterprises need clear processes for determining the value proposition of a big data analytics project. In this article, David Loshin examines the applications that make sense for these projects and the criteria that enterprises should use to weigh the costs and benefits of such a strategic investment.

Achieving Organizational Alignment for Big Data Analytics

Numerous aspects of big data analytics hold appeal, and while individuals within an organization can “test drive” them, these new technologies need to win adoption in a broader enterprise setting. Managers need to answer: What is the process for piloting technologies to determine their feasibility and business value? And: What must happen to bring big data analytics into organization’s system development lifecycle?

Developing a Strategy for Integrating Big Data Analytics into the Enterprise

As with any innovative technology that promises business value, there is a rush to embrace big data analytics as a key source of business value. This article explains how to consider the challenges and issues involved in bringing big data analytics into production.

Data Governance for Big Data Analytics: Considerations for Data Policies and Processes

With emerging big data use cases, datasets created for one purpose can be used for an entirely different purpose—a dynamic that challenges traditional approaches to data governance. This article explores ways to manage this conflict and build new governance policies.

Considerations for Storage, Appliances and NoSQL Systems for Big Data Analytics Management

Big data management and analytics applications rely on an ecosystem of components that can be combined in a variety of ways to address application requirements. This article examines three aspects of this ecosystem and associated technologies: storage, appliances, and data management.

An Introduction to Big Data Application Development and MapReduce

For any target big data platform, you must have an application development framework that supports a system development lifecycle and provides a means for loading and executing the developed application. This article discusses the principles involved and how programmers use the MapReduce and ECL frameworks to analyze big datasets.

Understanding the Big Data Stack: Hadoop’s Distributed File System

Hadoop is a collection of open source projects, combined to enable a software-based big data appliance. This article introduces a core aspect of Hadoop’s utilities, the Hadoop Distributed File System.

How Pig, Hive and Zookeeper Build Apps on Hadoop and MapReduce

This article examines the prototypical big data platform using Hadoop, and how Pig, Hive, HBase, Zookeeper and Mahout address these pieces of the puzzle.

Object Data Stores
Object data stores are essentially a hybrid approach to data storage and management; in some ways, object data stores and object databases seem to bridge the worlds of schema-less data management and the traditional relational models. On the one hand, approaches to object databases can be similar to document stores except that while the document stores explicitly serialize the object so the data values are stored as strings, object databases maintain the object structures. That is because they are bound to object-oriented programming languages such as C++, Objective-C, Java, and Smalltalk.

As opposed to some of the other NoSQL models , object database management systems are more likely to provide traditional ACID compliance (that is Atomicity, Consistency, Isolation, and Durability)—characteristics that are bound to database reliability. Yet this is one of the few similarities to a traditional relational database, and it is important to remember that object databases are not relational databases and are not queried using SQL.

Considerations for Implementing NoSQL
The decision to use a NoSQL data store instead of a relational model must be aligned with business users’ expectations. The key question: How will the performance of a NoSQL data store compare to their experiences using relational models?

As should be apparent, many NoSQL data management environments are engineered for two key criteria:

  • Fast accessibility, whether that means inserting data into the model or pulling it out via some query or access method, and
  • Scalability for volume, so as to support the accumulation and management of massive amounts of data.



Both of these criteria are addressed through distribution and parallelization, and the NoSQL styles described above are amenable to extensibility, scalability, and distribution. Moreover, these characteristic features dovetail with programming models like MapReduce that effectively manage the creation and running of multiple parallel execution threads. The key is leveraging data distribution. Fortunately, distributing a tabular data store or a key-value store allows many queries and data accesses to be performed simultaneously, especially when the hashing of the keys maps to different data storage nodes. NoSQL methods are designed for high performance computing for reporting and analysis, and smart data allocation strategies will enable linear performance scalability in relation to data volume.

There are many new companies who have embraced the different NoSQL models and are bringing their customized versions to market. If you are interested in NoSQL, there is little risk in trying out the different approaches, and it may make sense to develop a simple “pilot” project model that can be deployed in different ways to explore the similarities and differences in terms of ease-of-use, space performance, and execution speed.

Yet while the performance behaviors for NoSQL data management systems are appealing, they will not completely replace a relational database management system. Choosing to use NoSQL is not necessarily an easy decision.  One must weigh the business requirements as well as the skills needed to transition from a traditional approach to a NoSQL approach before committing to the technology.

David Loshin is the author of several books, including Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL and Graph, inspired by his articles at Data Informed. He is also author of the second edition of Business Intelligence—The Savvy Manager’s Guide. As president of Knowledge Integrity Inc., he consults with organizations in the areas of data governance, data quality, master data management and business intelligence. Email him at loshin@knowledge-integrity.com.

Home page photo of Leadenhall Building project in London (2012) by Martin Pettitt via Flickr.








Tags: , , ,

One Comment

  1. judi online asia
    Posted November 17, 2014 at 1:35 am | Permalink

    Thank you. I’ve been looking For info about this topic for a long time and yours as the greatest I have found so far. However, concerning the conclusion, are you certain in regards to the supply?

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>