Kudu and the Ongoing Evolution of Hadoop

by   |   January 5, 2016 3:00 pm   |   1 Comments

Product trajectories and impact are notoriously difficult to predict, but two recent Hadoop market moves – simultaneously aggressive and reconciliatory – may be an indication of a seismic shift at Cloudera toward more openness.

In September, Cloudera released a new product (“project” in open-source terminology) called Kudu, which provides updateable storage. This is a novelty in the straitlaced sequential, read-only world of Hadoop. Further, Cloudera has offered Kudu to the Apache Software Foundation for incubating as an Apache-sponsored open-source project. Kudu is not just another ingredient in the alphabet soup of the Hadoop ecosystem. It has the potential to be a game changer.

Then, more recently, Cloudera also turned over code for Impala, its popular interactive query engine, to the Apache Software Foundation.

To understand the significance of these events, let’s first take a look at Kudu.

A research paper from Cloudera shares intricate details of Kudu, and if you don’t have the time or patience – or the need – to dive into that depth, here is a good overview from Cloudera. Or just read on.

In a nutshell, Kudu is software that attempts to bring relational data management capabilities to Hadoop, managing its own storage and dispensing with HDFS altogether. This is a very loaded statement, so you may want to re-read and mull over it for a minute.

Related Stories

What Enterprise IT’s Embrace of Open Source Means for Big Data Analytics.
Read the story »

Making Hadoop Ready for Enterprise Prime Time.
Read the story »

10 Top Commercial Hadoop Platforms.
Read the story »

Hadoop Survey Offers Insight into Investment, Adoption.
Read the story »

Kudu is intended to bridge the gap between high-throughput sequential-access storage systems such as HDFS and low-latency, random-access systems such as HBase or Cassandra. Kudu would reduce some of the complexity of lambda architectures – for example, eliminating (or at least reducing) the need to develop parallel data flow streams for large-scale analytics and in-place updates on the same data set. Kudu does not purport to offer a “one size fits all” solution, but it should fit many use cases that require relatively fast read and the ability to perform in-place updates: a “happy medium” that can simplify development without sacrificing performance.

A Deeper Dive

Conventional database developers will find Kudu a simplified (or perhaps simplistic) implementation of standard relational databases like Oracle or Microsoft SQL Server. A Kudu cluster will consist of many tables; each table will have columns with names and data types, and can be made nullable. Each table will also have a primary key – an ordered tuple, in database lingo – to enforce uniqueness among rows. The primary key will provide the sole index to the table – Kudu does not allow secondary keys or unique indexes. Not yet, anyway. Kudu also does not offer multi-row transactions. Multiple rows may be batched for execution for purposes of efficiency, but each row operation will remain a separate transaction. Data read also appears simplistic: Kudu offers a scan operation in which the user may add any number of predicates (where clauses), but only two types of predicates are allowed: comparisons between a column and a constant value, and composite primary key ranges. The select clause of the query – the projection of the tuple – can include any number of columns. In fact, with Kudu being a columnar database, Cloudera recommends specifying columns in the select clause to improve performance.

Kudu partitions tables horizontally into data sets called tablets, and tablets are subdivided into rowsets that may reside in memory or on disk, an architecture that seems aimed at improving performance. Kudu uses a master node for metadata, which will need to be replicated for fault tolerance, and tablet servers that process the data.

There are limitations on data concurrency/consistency. Kudu guarantees read-after-write consistency as long as both the write and read are from the same query client. It does not guarantee that, if another query also does a write and then read, it will see the effects of the first query. Cloudera does not consider this to be much of a concern – the company’s experience indicates that internal data consistency (i.e., within the same client) is sufficient for most cases. This makes Kudu less attractive for applications in which multiple users are writing simultaneously, but Kudu does offer a way around, in the form of timestamps that may be propagated across clients (albeit with a performance penalty).

Kudu offers application programming interfaces (API) in Java and C++, with “experimental support” for Python.

I did not find any information on whether Kudu integrates with Sentry (for security) and Yarn (for resource management), but time will tell. If Kudu is to have a future, however, it needs to fully integrate with these components.

If your eyes haven’t glazed over with all this technospeak, here comes the interesting part: Kudu is deeply integrated with Cloudera Impala, and in fact does not do any SQL parsing on its own. SQL queries to Kudu must pass through Impala. This is a strategic masterstroke: in one fell swoop, Cloudera has provided a huge impetus to Impala and an instant value proposition for Kudu. Kudu also integrates with MapReduce and Apache Spark.

Measurements taken by Cloudera indicate that Kudu performance is comparable to immutable storage such as Parquet. This is a start, but “comparable” isn’t good enough. Kudu surely will need to improve performance – as measured by someone other than Cloudera – if it is to be adopted widely.

Updateable storage is one reason that Kudu is a potential game changer. The other reason, mentioned previously, is that Kudu dispenses with HDFS altogether, by managing its own storage directly on top of the operating system. In other words, Cloudera has a full-fledged alternative to the Hive/MapReduce/HDFS stack with Impala/Kudu, offering new capabilities, improved storage, improved performance, and a simplified architecture.

This is huge, and just another indication how the world of Hadoop continues to evolve not just incrementally, but also disruptively.

Cloudera also has quickly moved to defuse the charges that Kudu would lead to greater vendor lock-in – a blasphemy in the open-source world – by offering both Kudu and Impala to the Apache Software Foundation.

Kudu is an exciting development and one that I would urge you to look at closely, although it comes up short in some ways. If Cloudera is looking to capture the “analytical Hadoop” market – which seems to be the case – it will need to move rapidly to improve on some of the shortcomings and demonstrate to customers the usability and stability of Kudu.

When it comes to the product name, though, I am not too sure. Although both Impala and Kudu are a kind of deer – and the graphic for Kudu is remarkably similar to that for Impala – it seems that when threatened, the kudu often will run away rather than fight. Let’s hope that this isn’t an omen for Cloudera.

Rajan Chandras is director data architecture and strategy at a leading academic medical center in the northeast. He is a prolific contributor to well-known industry publications and has presented at industry and research conferences. He writes for himself and not for or on behalf of his employer. You can reach him at rchandras@gmail.com.


Subscribe to Data Informed
for the latest information and news on big data and analytics for the enterprise.


The Database Decision: Key Considerations to Keep in Mind




Tags: , , , , , ,

One Comment

  1. Posted January 18, 2016 at 10:19 am | Permalink

    MapR has had updatable storage since its release more than four years
    ago. You don’t have to wait for Kudu … you can have this feature
    right now.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>