It’s pretty easy to build a house if you have nails, wood, plaster, and insulation. It’s not so easy if you have nails, flour, yarn, and fertilizer.
In some ways, this is the challenge software developers face in today’s world of multi-source, multi-structured data. Innovating in an environment where data is exploding in its variety, size, and complexity is no simple task. Social data (comments, likes, shares, and posts), the Internet of Things (sensor data, motion detection, GPS), and advertising data (search, display, click-throughs) are just a few of the thousands—perhaps millions—of heterogeneous, dynamic, distributed, and ever-growing devices, sources, and formats that are driving the big data revolution.
It would be convenient to ignore the immense benefits of cross-utilizing all these disparate data sets. After all, like the raw materials needed to build a house, task-specific data is best suited for the job immediately at hand. Yet bringing together data from multiple sources can provide insights much more powerful than those from each source separately. For example, drawing data from satellites, city engineering data for buildings and roads, topographic information, user inputs, and so on makes accurate turn-by-turn navigation through maps applications possible. Similarly, e-commerce companies can leverage local customer, product, and store information as well as weather and location data to optimize inventory management and enable real-time, customer-specific offers.
Products and services leveraging multi-source, semi-structured data allow businesses to compete better and organizations to function more efficiently. They facilitate new business models and provide a deeper understanding of customers and constituents, problems, and phenomena. Most of all, they allow developers to innovate and uncover the possibilities of an interconnected world.
New Methods Needed
The challenges of analyzing this new amalgam of heterogeneous data are as complex and multi-layered as the data themselves. At its core, this is a software engineering problem: no one piece of software can do everything that developers will want to do. New methods are needed to overcome the size, bandwidth, and latency limitations of conventional relational database solutions.
So the transmission, collection, and storage of semi-structured data in native formats is now possible, as is the ability to scale this infrastructure to use data in a cost-effective manner as volumes and formats grow, without upfront planning. But this brings us to the next fundamental challenge: how to enable today’s business analysts to analyze the semi-structured data interactively, despite its complexity and diversity, for the discovery of insights.
New Age of Analytics
The Internet is awash with hype around RDBMS vs. Hadoop vs. NoSQL DBMS, with no clarity about when one should be used over the other—and for what kinds of analytical workloads.
Hadoop, one of the best-known processing frameworks, has the power to process vast amounts of semi-structured as well as structured data. Its appeal lies in its versatility, its high aggregate bandwidth across clusters of commodity hardware, and its affordability (at least in its pure, open-source form). However, Hadoop was designed for batch processing in a programming language familiar only to developers—not interactive ad hoc querying using a declarative language like SQL.
For these reasons, there is growing interest in developing interactive SQL engines on top of the same Hadoop cluster. There are open-source projects that attempt to set up Hadoop as a queryable data warehouse, but these are just getting started. Their task is daunting—no less than trying to re-invent a database on top of Hadoop.
Such projects offer very limited SQL support (HiveQL) and are typically lacking in SQL functions such as subqueries, “group by” analytics, etc. They rely on the Hive metastore, which requires defining table schemas up front for the semi-structured data attributes that you want to analyze, in order to allow an SQL-like language to manipulate this data. This is a self-defeating strategy. To explore and understand your multi-source data, you first must know it well enough to define its attributes and schema up front.
NoSQL databases like MongoDB have a built-in query framework to interrogate semi-structured data. But now you are burdening an operational database with the overhead of data access for longer-running analytical queries. This will cause conflicts as the data and usage grow. Additionally, Mongo’s query framework requires an understanding of how the data is physically laid out to avoid running into syntax, memory, and performance limitations on large data sets.
Things we take for granted in investigative analysis, such as joining data stored in two separate tables directly from within a query, queries with multiple values, and conditions or ranges not known up front are simply not possible using Mongo’s native analytics capabilities.
DWS for Semi-Structured Data
An advanced analytics platform for multi-source, semi-structured data sets in which ad hoc queries require scanning and joining of data across billions of records requires a more sophisticated approach. In particular, SQL as a query language is a must, to support broad use and deliver insights to decision makers quickly.
The answer can be found in a new class of Data Warehouse Service (DWS) designed for fast, low-latency analytics. In these services, data is stored in its original format with support for JSON, XML, and Key Values as native data types. This preserves the richness of the data and also circumvents the need for complex extract, transform, load (ETL), or any up-front modeling, before analysis.
With this class of DWS, developers can create solutions that not only leverage multi-source data for current needs, but also actually support as-yet undiscovered solutions that involve data not yet leveraged. In fact, in the more high-performing DWS offerings, analysts can access and work with data directly, using their favorite SQL-based BI tool and user-defined functions. This is because such offerings speak ANSI SQL, and SQL-2011 online analytical processing (OLAP) operators work directly over JSON, allowing workers with knowledge of SQL-based BI tools, modeling techniques, and accompanying domain expertise to operate in the semi-structured world. “What if” questions take on a whole new dimension because data can be cross-referenced from literally dozens of sources to provide insights never before possible.
DWS offers multiple benefits. First, because it exists in the cloud, all the cost and manpower savings of a cloud-based service accrue, including a huge drop in hardware expenditures, lower administrative effort, and the ability to scale on commodity hardware as data and performance needs grow. DWS not only eliminates the need to create and maintain a separate analytics infrastructure alongside an organization’s operational systems, but also reduces impact on transactional stores such as MongoDB, thus helping the organization meet its dashboard and reporting SLAs.
Second, by storing JSON datasets directly, DWS takes away the need to write custom code to build, collect, and integrate data from various streaming and semi-structured sources. Fresh, detailed data is available immediately on load for quick analytics and discovery. Analytics is resilient to schema changes and provides the same application flexibility as NoSQL, while data is preserved in its native format—it doesn’t lose semantic value through conversion (for example, when converting JSON to text).
Third, the ability to use SQL as the query language delivers powerful business advantages. It allows business users to do their own analysis, thereby freeing up developers from having to write Java or other code to get data for every question an analyst might have. Furthermore, the SQL capability of DWS lets organizations leverage the well-established ecosystem of SQL-based analytics tools.
The future of big data lies in the advent of tools and techniques that derive value out of multi-source, semi-structured data. Developers need to innovate with these solutions that support ad hoc queries with multiple values, conditions and ranges—the kind of intelligent questions made possible when “sum of all knowledge” systems have finally arrived. When speed and flexibility are paramount, developers must look to new solutions like advanced DWS to provide the answers. Like an architect given the rare opportunity to create using a whole new set of raw materials, when heterogeneous data is assembled from the new world of information, there’s no telling what amazing structures might arise in the future.
Harmeek Singh Bedi is CTO of BitYota. Harmeek brings 15+ years of experience building database technologies at Oracle and Informix/IBM. At Oracle, he was a lead architect in the server technology group that implemented partitioning, parallel execution, storage management, and SQL execution of the database server. Prior to BitYota, he spent 2+ years at Yahoo! working on Hadoop and big data problems. Harmeek holds 10+ patents in database technologies.