As your organization collects more data — and wants to make better use of the data it’s already acquiring — you now have options that stretch far beyond traditional databases. Using business intelligence tools by pulling extracts from your data warehouse may no longer be enough; what you really want is to mine all the business data you can get. And at the point where that data is too big to analyze fast enough by conventional means, it’s time for a hard look at database alternatives.
Your choice of database technology will be dictated by your project’s parameters: how much data, how fast it arrives, the analysis you’ll do and whether you’ll use that analysis operationally or for making strategic decisions. You may want to mine business insight from retail point-of-sale transactions or a CRM (customer relationship management) program. You might be trying to speed up logistics, identify where inventory is in real time in your supply chain or track the wanderings of large numbers of customers on a website.
Which databases are appropriate for what sorts of projects? The following is a guide to help you evaluate your options.
What it is: Traditional relational database system, with structured data organized in tables
Strengths: Uses standard SQL query language; well supported with tools and staff skills; guarantees consistency for results of queries
Use cases/examples: Conventional online transaction processing (OLTP)
Weaknesses: Requires complete table to be stored in file or memory; requires structured data and relational schema; canonical table must be replicated in order to perform analysis on parallel servers
If you’re dealing with structured, relatively simple transaction data–say, customer orders or point-of-sale data–and you want maximum flexibility in the questions you’ll be able to answer, the place to start is the familiar relational database. The advantages: You’re probably already using an RDBMS, so your IT staff includes database analysts who can create schema and SQL queries. You just need to choose an enterprise-class database that will scale up to match your data analysis needs, such as Oracle, IBM InfoSphere Warehouse, Sybase and Teradata.
The downsides: The price for familiarity and flexibility in queries is that data is rigorously structured and is stored in a single canonical set of tables that have to be replicated if you use parallel processing to speed up performance. Every update has to be replicated to other copies, so if data changes a lot, there’s a lot of overhead, which negatively affects the response times for queries.
You can keep those tables in memory instead of on disk in order to speed up reads and writes–but at a higher cost. And as tables grow to millions of rows, response time for your analytical queries can quickly slide from near-real-time down to sometime-this-week.
What it is: Relational database that stores data by column rather than by row
Strengths: All the strengths of a relational database, and optimized for better performance when analyzing data from a few columns at a time
Use cases/examples: Analyzing retail transactions by store, by state, by product, and other criteria
Weaknesses: Requires complete table to be stored in file or memory; requires structured data and relational schema
What if that data is highly structured but you know you’ll be doing a lot of analysis on data stored in just a few particular columns — say, analyzing retail transactions by store, by state or by product? Then a columnar database makes more sense, like Sybase IQ, Microsoft SQL Server 2012, Teradata Columnar or Actian VectorWise.
Despite the name, a columnar database is also a true relational database. The difference is under the covers. Instead of storing each data table as one row after another in a single file, a columnar database stores columns in separate files, so they can be indexed and searched separately, even on different servers in a cluster to improve performance.
As a result, columnar databases can be much faster than row-based relational databases for analysis that concentrates on only one or a few columns, and many staff skills can transfer directly to columnar. They also inherit many relational database drawbacks, including the staff effort and system processing time required to maintain canonical tables.
What it is: Database that directly stores software objects (from object-oriented programming languages such as C++ and Java)
Strengths: Designed for querying complex data; much faster at storing and retrieving data in software objects
Use cases/examples: Storing and analyzing complex data from a customer loyalty program
Weaknesses: Some products only support a few programming languages; no standard query language; fewer tools and staff skills available
Suppose instead that your project requires analyzing data from a customer loyalty program that uses a highly complex customer model. Your developers might already know that their biggest bottleneck will involve packing and unpacking data from software objects so the data can be transferred to and from a database. That’s a good candidate for an object-oriented database such as Objectivity/DB, Versant Object Database and db4o, which store entire objects automatically.
Letting the database handle object storage by itself has the advantage of simplifying programming and thus reducing the chance for bugs, and it may improve performance. But object-oriented databases aren’t relational, so there’s none of the flexibility for ad-hoc queries that’s available with SQL. And dealing with data complexity becomes a programming problem, not a database management issue.
What it is: Non-relational database that stores data as transaction logs rather than as tables
Strengths: Faster and more compact when dealing with very large but sparse databases; supports unstructured or irregularly structured data; supports easily distributing analysis to parallel servers
Use cases/examples: Tracking all user actions on a website (not just purchase transactions)
Weaknesses: Not mature technology; no standard query language or tools; does not guarantee consistency in query results
Suppose your project will deal with orders of magnitude more data that’s not nearly so structured — say, tracking inventory logistics in real time or analyzing how customers use your website. With conventional databases, that could require huge, multidimensional tables in which most of the cells would likely be empty, while others might have to hold multiple values at once.
That’s the kind of messy situation that NoSQL databases were created for. It’s a catch-all name for a class of non-relational databases that include Hadoop HBase, Hypertable, MongoDB and MarkLogic. With NoSQL, there’s no actual data table. Instead, the database consists of the list of all data writes; a query searches that list to find out what data was most recently written to a particular cell. (It’s as if you created an enormous table, logged every write, then threw away the table and just searched the log.)
There’s no need for a giant canonical table stored on disk or in memory. Individual cells are never written to a table, so “writes” are effectively instantaneous. And copies of the log-like database can be written to many servers within a cluster, so analysis can be split up in parallel. Compared to conventional data tables, NoSQL has a smaller footprint, performs much faster and is easier to parallelize.
But as the name suggests, NoSQL offers no relational integrity or consistency guarantees, and currently no standard query language. It’s most suitable for specific, well-defined analysis on very large volumes of data, especially when results need to be delivered in real time.
There’s also no widespread skills support for NoSQL databases, and that could be a problem. However, there are software tools that support NoSQL work such as Hadoop, which splits up the work to be performed on multiple servers in a cluster. And cloud-computing vendors are increasingly bundling those tools with their cloud server and storage offerings, so it’s not necessary to buy hardware in order to do NoSQL-based analysis.
Management Considerations to Keep in Mind
Whatever database you choose for a Big Data project, you’ll probably need to acquire new staff skills and technology. Make sure those costs fit into the context of your business project. Unfamiliar technology requires that the new technology get the proper vetting and skills investment. Your staff knows how to deal with conventional databases, but much of that knowledge won’t transfer directly. That means it’s a good idea build up from small-scale experiments and prototypes, as well as tapping the experience of other enterprises.
And keep in mind that vendors are always working to address the limitations and add features to meet new big-data needs — but they also may change techniques that your developers will need to use.
This kind of data analysis is still in its early days, and you’re likely to make mistakes. The less costly those are, the better. But the sooner you begin identifying the best database technology for your own analysis projects, the better off you’ll be.
Frank Hayes is a veteran technology journalist and freelance writer.