Over the past few years, the big data industry has focused on the promise of analyzing unprecedented volume, variety, and velocity of data. While early adopters have been the world’s largest companies, research, and government organizations, big data is now being embraced by every company in the world to get insights into customer data to create brand loyalty and find new ways to monetize customer communities. To effectively compete on this new battleground, companies realize they must focus on understanding today’s data and move beyond overnight batch analysis popularized by enterprise data warehouses. This is real-time analytics.
There are a number of new data management platforms available today for big data analytics. These include variations of Hadoop and columnar databases. Columnar databases have had significant commercial success recently with the ability to both ingest large amounts of data quickly and perform complex read operations typical of analytics workloads. Hadoop has attracted interest from enterprise customers with the promise of storing vast amounts of unrelated data for analysis. A common trend across all of these platforms is renewed focus on SQL query interfaces which provide a powerful and well understood method for developers to connect the dots across data without requiring extensive custom programming by data scientists. Non-relational databases, or “NoSQL,” are generally best at high volume simple transactions with very minimal analytic capabilities.
Both Hadoop and columnar databases are specialized big data platforms that provide modern alternatives to the enterprise data warehouse for historical batch analytics. With companies focused on immediate customer insight and developers embedding more analytics logic into their applications, a new generation of SQL databases is emerging to enable real-time analytic workloads directly on the live operational data. This is a potential industry game changer.
Why Is Real-Time Analytics Important?
It’s easiest to see the value of real-time analytics in a specific application like a modern e-commerce site. These sites have evolved greatly beyond basic product catalogs and order taking. They are much more customer specific, tracking an individual’s navigation, customizing real-time promotions and recommendations, building dynamic customer profiles, and integrating insights from social media activity—all with the goal of keeping customers engaged and maximizing per-customer profit.
This requires millisecond response times in both presenting the most relevant information to the customer and capturing customer requests or activities. Most of these interactions are fairly simple read and write transactions and the goal is consistently good performance across a large number of customers. Under the hood, the application is constantly connecting the dots joining data from many fast changing data tables to create that customized experience. These operations are complex queries that are similar to a data warehouse operation—except they need to complete in seconds on live operational data.
In addition, marketing specialists are constantly looking at second-to-second trends across specific product areas or customer demographics. These seemingly simple questions like how many men between 30 and 40 years old bought golf clubs, whether they responded to a specific pricing promotion, and if they were active on customer forums before they bought, are generated through reports and analysis tools that can often lead to very complex read queries. Historically, database administrators discouraged those queries on live operational data because they may have interfered with time-sensitive customer interactions. But in today’s hyper-competitive e-commerce market, understanding up-to-date customer data (as opposed to stale data) can make a significant difference in market share.
Larger e-commerce companies use either traditional data warehouses or new approaches like Hadoop for more complex historical batch analytics to pinpoint customer trends and optimize inventory. That requires significant investment in data architecture, separate data repositories, and ETL (extract, transform and load) connectors fed by many operational data sources such as inventory, billing and point-of-sale data. These systems are expensive to maintain and create time lags in data, so are best used for high complexity analysis for decisions that have very high potential business impact.
New Distributed SQL Databases Enable Real-Time Analytics
The database industry is in the midst of a major change, from traditional scale-up databases built on ever-larger hardware platforms to scale-out databases built on cloud infrastructure. In particular, “NewSQL” databases support standard SQL queries with linear horizontal scale with a pay-as-you grow approach for adding incremental compute and storage resources in modular building blocks. Generally, these systems promise full ACID compliance and target operational databases, which are system-of-record databases for customer data and transactions. All are based on new parallel-distributed database architectures.
Scale-out databases were first introduced almost 20 years ago in data warehouses as massively-parallel processing engines optimized for high performance read-only operations typical of analytics. The secret of these databases is a parallel query planner technology that calculates the best way to execute complex queries across distributed resources. NewSQL databases are now combining the best learning from these parallel query planners into their distributed architectures. The result is a single operational database that supports both massive transaction volume and real-time analytics on the same data with virtually unlimited horizontal scale.
A common concern on mixing transactional and analytics workloads on the same data is potential for a single complex query to take over the system and create inconsistent performance for important customer transactions. That can be addressed by prioritization of transactional queries over analytics operations. For more intensive ongoing analytic workloads, a simple approach is to create a replicated database that can be used to serve all analytics queries and also be used for a hot disaster-recovery system. While this requires an additional database instance, it avoids complexity, expense and time lags associated with ETL. These NewSQL databases include significant automation to be near self-managing.
Is In-Memory the Answer to Real-Time Analytics?
In-memory database methods are seeing a resurgence. Queries on datasets residing in-memory reduce I/O latencies from roughly 100 milliseconds with hard disks to a few microseconds with RAM. With the declining costs of RAM and increased capacity, there are many ways in-memory can help in real-time analytics.
The most common way is to use pure in-memory databases to analyze streams of fast- changing data or events. Typically these datasets do not need to be persistent so the stream is reasonably small and can be completely stored in-memory for very fast processing. The result of analysis is then written to a separate persistent operational database.
In-memory methods can also be used as an accelerator for traditional scale-up databases. While this certainly can dramatically improve performance of complex analytics, it comes at an extreme cost since it requires high-end servers that are designed for many terabytes or tens of terabytes of RAM.
A smarter way is to use scale-out or distributed databases that can take advantage of distributed RAM in each server building block, thereby taking advantage of commodity infrastructure economics. However, even these configurations can be more than 10 times the cost of traditional disk for reasonable datasets of one terabyte or more.
The smart and cost-effective way to use in-memory for real-time analytics is combining distributed architectures with RAM and flash disk. Flash provides I/O latencies of 50 microseconds and costs for terabyte configurations are rapidly approaching that of high-performance hard disk drives. A smart database can cache the hottest frequently accessed data in RAM and “warm” operational data in flash providing substantial speed-up in query execution in all cases. And equally important, the data is persistently stored on flash disk so it can survive day-to-day infrastructure failures. This is particularly critical for operational data.
The Big Picture
Demand for real-time analytics on live customer data will only increase as big data is adopted by mainstream companies of all sizes. There will also increasingly be a focus on simple approaches that don’t require highly-specialized programming or operational skills. Likely this will lead to collapsing of various specialized data management techniques into sophisticated platforms purpose-built for distributed cloud computing architectures.
The future data management architecture will include two types of consolidated distributed database platforms. One will be a SQL operational database that incorporates smart in-memory techniques for query speed-up and high speed stream processing and is increasingly sophisticated at executing both analytics and transactions at consistently high performance. The other will likely be a Hadoop platform for complex historical analytics on very large datasets with a SQL query interface, smart in-memory methods, and support for a wide variety of structured and unstructured data types. For large IT organizations, these new platforms will be most valuable for customer-facing and front-office applications, integrating with traditional operational databases and data warehouses for back office system-of-record data where needed.
The big data industry is shifting focus from dealing with vast amounts of data to real-time analytics. Engagement and monetization of customers based on up-to-date data is a key competitive advantage that promises real business benefits to companies of all sizes. Scale-out SQL databases are critical enablers for mainstream adoption of this trend, with the increasing desire to run real-time analytics on live operational data. Smart uses of in-memory methods together with flash storage promise high performance at commodity infrastructure economics. The result is a potential game changer in the data management landscape.
Robin Purohit, is the CEO of Clustrix, maker of the ClustrixDB distributed SQL database. Previously, he was vice president and general manager at HP’s IT Management, Information Management and Application Security products division. He also has held senior executive positions at VERITAS and Mercury. He holds a Bachelor of Science degree in engineering and physics from the University of Waterloo. Follow him on Twitter: @RobinPurohit1.