The combination of simultaneous transactional workloads and analytics workloads (i.e., big data) is a concept that is gaining traction among database vendors and industry analysts.
Gartner recently coined the term “HTAP,” meaning Hybrid Transaction/Analytical Processing.
HTAP is about spotting trends and being aware of leading indicators in order to take immediate action. For example, an online retailer could use HTAP to find out what products are currently trending as best-sellers in the last hour. He could see snow shovels are selling well and react by putting a snow shovel promotion on the home page to leverage the trend.
Much of big data analytics has focused on information discovery, essentially using Hadoop for batch-oriented storing and querying of large amounts of data sometimes referred to as “data lakes.” However, the need for many data-driven businesses is often real-time discovery and analysis of transactional data to make meaningful changes immediately. Real-time discovery and analysis requires repeatable analytics on very recent data. This rules out the traditional approach of offloading transactional data into a separate analytics database for offline analysis.
The Ideal HTAP Scenario
With this in mind, ideally a company would want a system that delivers uniform operational throughput even as an analytics workload is added to the mix. These characteristics are depicted in the chart below.
With a traditional SQL database, workloads would interfere with each other, with subsequent degradation in performance. So to support HTAP, it would be necessary to over-provision a traditional SQL database with hardware capable of handling the most extreme HTAP cases.
Why Traditional DBMS Fail to Deliver HTAP
Traditional client/server database management systems (DBMS) were not designed for demanding HTAP use cases. Their limitations to adequately support HTAP can be attributed broadly to the inability to scale (memory, I/O) and lock contention.
Conventional lock-based DBMS have trouble executing long-running transactions because readers block writers and writers block readers.
Analytical processing must simultaneously read large portions of the database, thereby increasing the chances for a lock conflict. The natural consequences of lock conflicts are that users will experience noticeable performance drops when running analytics workloads on the same DBMS as operational workloads.
One solution to this problem is to use a database that supports multi-version concurrency control. This would enable snapshot isolation, allowing a long-running reader (such as an analytics query) to operate on a historical snapshot of the database while newer versions of records are independently created or mutated by OLTP-style INSERT and UPDATE loads. However, wherever analytics workloads operate in parallel, the inability to scale becomes a factor. As the intensity of the analytics workloads increases, pressure on CPU, memory, network, and I/O resources will increase and begin to affect throughput for all workloads, which is unacceptable for critical operational workload-based business processes.
To compound the challenge, optimizing an HTAP system also requires that users understand that cache line, paging, prefetch, and disk scan patterns are different for operational and analytics workloads. The traditional HTAP system cannot optimize for both at the same time, so thrashing occurs as the two patterns compete.
One response to these kinds of resource pressures is to run a traditional RDBMS on a very high-performance server configured with a large number of processing cores, huge memory, high-capability network, and I/O subsystem. This “scale-up” approach may provide additional headroom, but it is very expensive and ultimately limited to the capabilities of the largest server available.
In addition, scaling an HTAP system based on a traditional DBMS in cloud-based environments is yet another limiting factor, because these environments tend to be based on commodity hardware with fewer cores, less memory, and less-capable network and I/O. Inherently, the cloud is the antithesis of scale up. Rather, it’s about elasticity – scaling out and back in – providing on-demand, commodity resources at the lowest total cost of ownership.
To demonstrate HTAP capabilities, I tested a NewSQL database with the three following workloads, which were designed on the AWS cloud to simulate a real-world mix of operational and analytics workloads:
• Load workload: create a database on which operational and analytics queries could operate
• Operational workloads: repeatedly query the database and update selected records. The database traffic for this workload is directed to a TE dedicated to this type of workload
• Analytics workloads: query the database for recently updated records. The database traffic for this workload is directed to two TEs dedicated to this workload.
The “Load” workload is run once, resulting in a database containing a configurable number of rows in a single table. Multiple instances of the “Operational” workload are then started in parallel on a single host and their performance is measured as they repeatedly query and update the table. A short while later, multiple instances of the “Analytics” workload are stagger-started in parallel on two hosts, and their performance is monitored as they query for records recently updated by the operational workload. The analytics workloads are then halted while the operational workloads are allowed to continue, to examine how the operational workloads react and recover.
These workloads were implemented as a single Java application that can execute any of the three workloads (by specifying a command-line parameter).
The database contains a single table, “STATE_DATA,” which represents simulated update transactions that occur in any of the U.S. states. The table includes five columns, and indexes are created against two of the columns.
The simulated transactions in each row record a pseudo-randomly selected NAME of a U.S. state in which the simulated transaction occurred:
• A pseudo-random ID for the transaction
• A creation timestamp TS
• A last-updated timestamp TSU
• A pseudo-randomly generated numeric VALUE
When performing the initial table load, the application, running in “Load” mode, inserts a pseudo-randomly selected U.S. state name into the NAME column, pseudo-randomly generated integers limited to a specified range into the ID and VALUE columns, and the current date/time into the TS and TSU columns. The number of rows to load is configurable.
When running the operational workload, the code first performs a SELECT with a pseudo-randomly generated ID value specified in the WHERE clause. If the result set is not empty, an UPDATE is performed against all rows sharing this ID, setting the value in the TSU column to the current date/time. Consequently, values in the TSU column are constantly being updated, simulating a transactional workload.
When running as in the analytics mode, the code selects records in which the TSU column contains a value that’s been updated within the last five minutes. This is expected to yield tens of thousands of results in a database of a million or more rows:
select * from state_data where tsu between ? and ?
where the query parameters are replaced at runtime to select for records that have been updated in the last five minutes.
When in operational or analytics mode, the workload application emits a log statement on a configurable periodic basis. The statements look something like this for the operational workload:
OPERATIONAL 1414280660719 537.4
OPERATIONAL 1414280661719 516.8
OPERATIONAL 1414280662719 501.9
OPERATIONAL 1414280663719 490.8
OPERATIONAL 1414280664719 475.6
Or this, for the analytics workload:
ANALYTICS 1413679858461 78462.0
ANALYTICS 1413679859461 118087.5
ANALYTICS 1413679860461 157964.3
ANALYTICS 1413679861461 158312.3
The first column indicates the type of workload, to facilitate post-processing. The middle column is a time stamp, and the third column indicates the number of rows processed:
• Operational: number of rows updated in the last one-second interval
• Analytics: number of rows identified in the last n-second interval as having been updated in the last five minutes. For these tests, a row is emitted every five seconds, and this data normalized to one second for the results reported below.
The following steps are taken to run the test:
The workload application is run in “Load” mode to load the desired number of rows. For this test, 5 million rows are loaded. This step is run once on the host configured to service operational workloads.
On the operational host, five instances of the workload application in “Operational” mode are started in parallel and allowed to run for several minutes.
On the two analytics hosts, 10 instances of the workload application in “Analytics” mode are stagger-started, as follows:
• On one of the two analytics hosts, an instance of the analytics workload is started.
• A short while later (about 30 seconds), a second instance of the analytics workload is started.
This pattern is repeated until five analytics workloads are running on one analytics host.
A short while later, a first instance of the analytics workload is started on the second analytics host. More analytics workloads are started, after short delays, until five are running on the second analytics host.
The full set of workloads – five operational and 10 analytics – are run in parallel for a while.
The analytic workloads are stopped, and the operational workloads are allowed to continue running on their own for a while longer.
The graph below shows the results of the test run.
The operational throughput remains relatively steady throughout the test, at about 300 updates per second for each of the operational workloads, for a total throughput approximately of 1,400 TPS.
When all 10 analytics workloads are running, the throughput of each is in the range of 80,000 to 170,000 updates found per five-second period, for a total throughput of about 1.4 million updates found per five-second period, normalized to about 280,000 per one-second period in the graph.
The CPU loading for the various hosts was monitored throughout the test. When all 10 analytics workloads are running, the two hosts running the analytics workloads and TEs are at about 90 percent CPU. The CPUs on the operational workload and two SM hosts are in the 10 percent to 25 percent range.
The key findings are as follows:
• The five operational workloads are unaffected by the analytics workload, even as the number of those analytics workloads is increased to 10 parallel instances.
• The analytics workloads show a steady ramp in total throughput as the number of parallel instances is increased to a total of 10. The resulting total throughput indicates a linear horizontal scale-out model.
This example demonstrates the following:
• Multi-version concurrency control support: Analytical queries on recent transactional data won’t block ongoing OLTP workloads.
• Performance scale-out: Dynamically scale-out horizontally to accommodate additional simultaneous diverse workloads. Burst out for short-lived analytics workloads and scale back down later when no longer needed.
• Isolated distributed caches: TEs avoid cache pollution because caches stay “hot” for specific assigned workloads.
It is this combination of characteristics that make HTAP possible.
Dr. Michael Waclawiczek is Vice President, Marketing and Operations, at NuoDB. He directs the company’s marketing efforts across corporate communications, product marketing, online marketing, and lead generation. He also leads the company’s cross-functional initiatives in his role as vice president of operations.
Prior to his corporate experience, Michael was Assistant Professor at the Technical University of Vienna, Austria, where he also received his Master’s and Doctorate degrees in Mechanical Engineering.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.