As many enterprises start focusing on gaining better business insights with Big Data, it would be careless for them to overlook the advances that Hadoop is making with SQL.
SQL-on-Hadoop brings the most popular language for data access to the most scalable database framework available. Because SQL queries traditionally have been used to retrieve large amounts of records from a database quickly and efficiently, it is only natural to combine this standard interactive language with Hadoop’s proven ability to scale to dozens of petabytes on commodity servers. As a result of these converging technologies, there is plenty of optimism that the promise of Hadoop can be realized. Previous challenges of not being able to get data in and out of Hadoop are now mitigated through the flexibility of SQL.
But access and flexibility are only part of the SQL-on-Hadoop marriage. We’ve seen firsthand how customers interact with real-time, transactional SQL-on-Hadoop databases. They may be familiar with a variety of databases, from traditional relational database management systems (RDBMS) such as MySQL and Oracle to a new generation of highly scalable NoSQL options such as Cassandra or MongoDB, but SQL-on-Hadoop solutions offer a best-of-both-worlds approach.
Even with their potential benefits, it’s important to tread carefully because not all SQL-on-Hadoop solutions are created equal. Choosing the right one is a critical decision that can have a long-term impact on a company’s application infrastructure. In working closely with enterprises to solve their Big Data challenges, we have identified the top five issues enterprises need to be mindful of when choosing a SQL-on-Hadoop solution:
1. Supporting real-time applications. This includes real-time operational analytics and traditional operational applications such as web, mobile, and social applications as well as enterprise software. Like many companies, one of our customers had real-time applications that required queries to respond in milliseconds to seconds. While this demand can be handled by traditional RDBMS systems, the client also faced growing data volume, which was making its Oracle databases expensive to maintain and scale. When real-time support cannot be compromised, it leads companies to either scale up at great costs or try to re-create functionality while scaling out. In this instance, our SQL-on-Hadoop database allowed them to execute in real time with an almost 10x price-performance improvement.
2. Working with up-to-the-second data. This means real-time queries on real-time data. Some solutions claim to be real-time because they can do real-time ad-hoc queries, but it is not real time if the data is from yesterday’s ETL (Extract, Transform, Load). For example, an e-commerce company we worked with evaluated many SQL-on-Hadoop solutions but found many of them lacking the ability to update data in real-time. This was a critical requirement as they needed to analyze real-time order, pricing, and inventory information to trigger real-time discounts and inventory replenishment orders. While it may not be mission-critical for all applications, up-to-the-second data streams can enable companies to derive maximum business value from their SQL-on-Hadoop investment.
3. Maintaining data integrity. Database transactions are required to reliably perform real-time updates without data loss or corruption. They are a hallmark of traditional RDBMS solutions, but we have heard of many enterprises that made the switch to NoSQL solutions and missed the reliability and integrity of an RDBMS. Working with a large cable TV provider, we discovered that transactions are even important in analytics, as data and secondary indices need to be updated together to ensure consistency. For its operational analytics applications, this customer found that it could not reliably stream updates in its SQL-on-Hadoop database without having ACID transactions.
4. Preserving SQL support. Many companies have made large investments in SQL over the years. It’s a proven language with published standards like ANSI SQL. This has led to many companies trying to retain standard SQL in their databases, causing them to forgo the NoSQL movement. However, even in some SQL-on-Hadoop solutions, the SQL provided is a limited, unusual variant that requires retraining and partially rewriting applications. One of our customers in the advertising technology space switched from Hive because its limited variant of SQL, known as Hive Query Language (HQL), could not support the full range of ad hoc queries that the company’s analysts required. More and more SQL-on-Hadoop vendors are moving to full SQL support, so it’s important to check SQL coverage when making a decision.
5. Supporting concurrent database updates. Many operational and analytical applications are receiving data from multiple sources simultaneously. However, not all SQL-on-Hadoop solutions can support concurrent database updates. This not only can interfere with the recency of the data, but also can lock up the database. One of our customers evaluated an analytic database that provided transactions, but any update or insert would lock the entire table. This meant that a table could support only one update at a time and made it impractical to do significant updates more than a few times a day. For applications with many streaming data sources (such as a website or a sensor array), a reduced frequency of updates can greatly hinder the value it can create for users
SQL-on-Hadoop applications will play a large role in fueling the growth of the Big Data market. According to IDC, the Big Data industry will surge at a 27 percent compound annual growth rate and reach $32.4 billion in 2017. This increase is indicative of the universal demand to realize the promise of Big Data by becoming real-time, data-driven businesses. Companies able to leverage emerging solutions like SQL-on-Hadoop databases stand to gain the biggest benefits by driving business processes resulting in better customer insight, improved operational efficiencies and smarter business decisions. For companies that see SQL-on-Hadoop solutions as a critical component of their long-term data management infrastructure, it is important that they ask the right questions to ensure that their chosen SQL-on-Hadoop solution can adequately address all of their application needs, including real-time operational and analytical applications.
Monte Zweben is co-founder and CEO of Splice Machine. He founded Red Pepper Software and Blue Martini Software, and is currently on the Board of Directors of Rocket Fuel. He holds a B.S. in Computer Science/Management from Carnegie Mellon University, and M.S. in Computer Science from Stanford University.