For the past two years, numerous organizations in the big data community have been bragging about performance results based on the original TPC-DS 1.0 benchmark. Rather than formally running TPC-DS according to specification guidelines and publishing fully certified results, they cherry-picked those portions of the TPC-DS benchmark that made their particular brand of technology excel, ignoring the general use case TPC-DS tries to test. Many marketing publications used a subset of the schema and queries, executed the benchmark in a special way, and reported a metric that positioned their system in the best possible light. This kind of “benchmarketing” is not new to the industry, and it’s precisely what triggered the founding of the TPC 25 years ago.
Thus, the TPC was faced with a choice: Either it questions the credibility of these highly customized claims and begins fining specific vendors for violating the TPC’s fair use polices, or it invites the major players in the big data space to join the TPC and help design a new decision support benchmark.
The TPC chose the latter. This open call for participation – and the TPC’s Technology Conference Series of Performance Evaluation and Benchmarking (TPCTC) – brought a number of industry experts and academics together, establishing a common forum that ultimately resulted in the new TPC-DS 2.0 benchmark.
In a nutshell, TPC-DS 2.0 is the first industry standard benchmark for measuring the end-to-end performance of SQL-based big data systems. Building upon the well-studied TPC-DS benchmark 1.0, Version 2.0 was specifically designed for SQL-based big data while retaining all key characteristics of a decision support benchmark. The richness and broad applicability of the schema, the ability to generate 100TB of realistic data on clustered systems, and the very large number of complex queries makes TPC-DS 2.0 the top candidate to show off performance of SQL-based big data solutions.
Following are the most important similarities and differences between Version 1.0 and Version 2.0, which explain big data vendor interest in crafting this benchmark and why we expect renewed interest in publishing formal results:
- TPC-DS 1.0 stresses the upward boundaries of hardware system performance in the areas of CPU utilization, memory utilization, I/O subsystem utilization, and the ability of the operating system and database software to perform various complex functions important to Decision Support Systems. Version 1.0 was designed to examine large volumes of data, compute and efficiently execute queries with a high degree of complexity and a large number of user sessions, and give answers to critical business questions.
- TPC-DS 2.0 increases the minimum raw database size to 1TB and allows benchmark publications of up to 100TB. Without doubt, big data solutions are engaged in industry problems that deal with very large amounts of data, although the definition of “very large” continually evolves and is drastically different than it was even a few years ago. So while TPC-DS 1.0 allowed database sizes of as small as 100 GB, this is a small fraction of a typical disk today. Maintaining this size requirement would have allowed anybody to publish a benchmark result on a system that today is considered a very small data set and still call the system a big data solution. Because big data solutions are steadily improving and aiming at operating on even larger data sets in the future, the TPC is evaluating data sets size up to 1PB.
- Unlike traditional data warehouse systems, big data systems concentrate on the read-only aspect of decision support systems. Just-in-time data processing is compromised for fast, read-only query processing. Many current big data implementations do not support true updates on their data sets. TPC-DS 1.0 essentially required vendors to implement updates efficiently to be able to perform trickle updates. Because big data players won’t currently be able to compete in this case, TPC-DS 2.0 eliminates update statements on dimension tables from the benchmark. The TPC believes that the current inserts and delete operations on fact tables are sensible and adequate given the state of big data systems.
- Due to de-coupling the ownership of data from the processing of data, big data solutions are inherently not ACID (Atomicity, Consistency, Isolation and Durability) compliant. However, most are BASE (Basically Available, Soft state, Eventual consistency) compliant. That means they guarantee some level of data accessibility through data mirroring. Instead of requiring compliance with ACID, TPC-DS 2.0 requires the system under test (SUT) to continue executing queries and data-maintenance functions with full data access during and after a permanent irrecoverable failure of any single durable medium. TPC-DS 2.0 for the first time mandates a durability test on the SUT during the performance tests. This makes the durability test not only a functionality test, but also a performance test.
- The metric in TPC-DS 2.0 has been changed from being an arithmetic mean of load, single-user, multi-user, and data maintenance to a geometric mean of the same components. This addresses concerns that the original metric could, for some implementations, be dominated by the data maintenance time. The additive nature of the original metric makes any slow component the dominant factor of the metric. Some big data players voiced concerns that their systems might be at a disadvantage if one portion of the metric dominated the benchmark. Hence, the TPC-DS 2.0 subcommittee requested the metric to be changed form a simple average-based metric to a product (geomean)-based metric.
- TPC-DS 1.0 required that all defined constraints (for example, primary, foreign key, and “not null” constraints) must also be enforced. In TPC-DS 2.0, both enforced and non-enforced constraints are allowed. Because many big data implementations are not able to control access to the data they query, enforcing constraints is impossible. However, constraints provide valuable information about the data so that query compilers can generate reasonable query plans. This information, if needed by an implementation, should not come for free. Hence, if unenforced constraints are defined, the system still has proof the dataset adheres to the constraints by running queries as part of the load test.
- TPC-DS 2.0 separates the querying of data from data maintenance. Because the overlapping of queries and data maintenance requires ACID, TPC-DS 2.0 reverted to a simpler model, in which queries and data maintenance are strictly distinct. This seems to fit many big data use cases as they concentrate on analyzing data and provide time windows for refreshing the data set.
TPC-DS 2.0 has now become the TPC’s second big data benchmark. In August 2014, the TPC introduced TPCx-HS – the industry’s first standard for benchmarking big data systems, designed to assess a broad range of system topologies and implementation methodologies. TPCx-HS was also the TPC’s first “Express” class benchmark, publicly available via downloadable kit. Looking ahead, we already are working on a third big data benchmark, TPCx-BB, an Express class benchmark that is open for public review at http://www.tpc.org/tpcx-bb/default.asp.
If you’d like to provide feedback, we’d love to hear from you.
Meikel Poess is chairman of the TPC-DS 2.0 committee and principal developer at Oracle Corporation. He is a software developer with 14 years of experience in performance tuning in all phases of software development and sizing of database systems.
Raghunath Nambiar is the chairman of the TPCTC, a distinguished engineer at Cisco, and chief architect of big data and analytics solution engineering. His current focus areas include emerging technologies, data center solutions, and big data and analytics strategy.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.