Big Data Warehousing: Leaving the Data Lake Behind

by   |   May 6, 2015 5:30 am   |   2 Comments

Sun Rising on the term Data Lake

Ever since Pentaho CTO James Dixon purportedly introduced the term “data lake” in late 2010, the idea seems only to have grown in stature. Where it was once but a technology artist’s concept of how big data might be imagined, the data lake now is poised to become not only accepted nomenclature, but also an ultimate goal of big data strategy, as in a nonchalant, “So what’s your data lake strategy?” from your friendly neighborhood technology vendor.

Everyone, it seems, has taken to the term data lake. Cloudera uses it. Hortonworks uses it. CapGemini and Pivotal have collaborated on a “business data lake” solution, and Informatica seems to have signed on to it. GE speaks of “angling in the (industrial) data lake.” Fortune speaks of a “data lake dream” as a place with data-centered architecture, where silos are minimized and applications are no longer islands, existing (in a rather mixed metaphor) “within the data cloud.” Gartner would like us to “beware of the data lake fallacy.” (Another linguistic snafu – Garter is not calling data lake a fallacy, as the wording suggests, but seems to have bought into the concept and is merely pointing out that the concept carries risks.) In a column on Data Informed last week, Darren Cunningham of SnapLogic wrote of the recent Strata + Hadoop World conference held in San Francisco, where he found “genuine excitement about the data lake as a way to extend the traditional data warehouse and, in some cases, replace it altogether.”

James Dixon used the term “data lake” to refer to something that was not pre-formatted, contrasting it with a store of bottled water. The implications seem clear: Just as you are not limited to getting water in a pre-defined format, a data lake can be used to fill a container of any form or size – a small bottle or a full tanker. It’s a clever analogy. Notice the implicit “late binding” to consumption format. From a sourcing perspective too, the analogy does hold some water (no pun intended): As a real lake can be sourced from all kinds of sources, so can a data lake.

Related Stories

The Data Lake: Half Empty or Half Full?
Read the story »

On-Demand Webinar: Improve Real-time Processing and Insight with a Hadoop Operational Data Lake.
Read the story »

4 Ways to Avoid a Data Swamp.
Read the story »

Defining Elements of the Next-Generation Data Warehouse.
Read the story »

But on deeper examination, the analogy begins to weaken – or, shall we say, the water begins to get a little muddy. The essential weakness of the analogy is that when we draw water from a lake, we have no control over which kind of water (and from what source) we get. Analysts, bloggers, commentators, and detractors found amusing ways to extend and exploit the analogy, with terms such as “data swamp” or “data cesspool.” An imaginative friend of mine recently referred to it as a “data landfill,” a neat twist on the original concept.

But, all in all, it remains an elegant, if simplistic, analogy. Five years ago, simplicity was necessary to introduce a relatively new and complex concept. Five years later, the analogy appears dated and inadequate. Much water has flown under the bridge since then (pun intended). We now have a much broader and deeper understanding of big data and the nuances thereof, technologies to address big data have matured and improved significantly over the last few years, and Hadoop 2.0 is turning out to be much more than a marketing gimmick for competitive positioning. Use cases, no longer confined to research labs and brilliant deviants, are now radiating across the industry spectrum, demonstrating the business value of big data, and associated architectural patterns are emerging that demonstrate what works and what doesn’t, simultaneously helping educate us and point out directions for future Hadoop product development.

To continue to use the term “data lake” today is to fail to understand or acknowledge how far we have progressed since Dixon felt the need to simplify the vision of a complex and revolutionary new concept. It would be interesting to learn if Dixon continues to think privately that “data lake” is the best way to describe a big data vision, but I fear it’s too late to ask. Even if he were to think so, it would be unfair to expect him to admit the limitations of a term, attributed to him, that is now practically de rigueur when discussing big data.

It seems much too late to put the data lake genie back in the bottle, but it is possible to wade beyond the concept.

I prefer to think of the big data paradigm in terms of two simple but powerful phrases that go beyond buzzwords and marketing-speak: big data strategy, and big data warehousing (Big DW). “Big DW” is admittedly far less imaginative and catchy than “data lake,” but it captures all the potential as well as complexity of big data in a familiar paradigm while placing no constraints or limitations – except to the dogmatic, who may be tend to constrain their concept of data warehousing in terms of implementation architectures (e.g., star schemas and conformed dimensions). Hadoop offers a different vision and set of capabilities – and, of course, scale – for the warehousing of data, but the fundamental value proposition for businesses remains the same: a mechanism to aggregate and process data to derive insight into the current state and the future.

Big DW goes beyond conventional data warehousing and represents an important next-generation tactic in the endless corporate struggle of survival of the fittest. And with the increasing maturation of Hadoop, related technologies, and use cases, we are at a very exciting point in time.

But, alas, until someone comes up with sexier alternative to “data lake” (one that is also more apt and contemporary), I fear we have no option but to continue to tread water.

Rajan Chandras is director data architecture and strategy at a leading academic health center in the northeast. He is a prolific contributor to well-known industry publications and has presented at industry and research conferences. He writes for himself and not for or on behalf of his employer. And, to his chagrin, finds himself continuing to use the term “data lake” on occasion.

Subscribe to Data Informed
for the latest information and news on big data and analytics for the enterprise.

Improving access to data across your company/partner ecosystem

Tags: , , , ,


  1. James Dixon
    Posted June 24, 2015 at 10:15 pm | Permalink

    Hi Rajan,

    You bring up some interesting points. Here are my thoughts.

    I created the data lake analogy to help describe a general solution that could be used to solve a set of similar big data problems. I discovered these problems by talking to a number of big data early adopters. Not all of the problems I heard about fitted this solution, but the many of them did.

    Since that date, a few people have poked holes at the analogy, typically without providing anything better. Some of these criticisms were just the result of expecting too much of an analogy – and treating it as a design or architecture template. Some people complained about the name, but that’s easy to do. Cloud computing, big data, NoSQL – all of these are bad descriptions of the technologies they represent.

    I agree that we have definitely progressed in the big data space in the last five years. We have seen the use cases become more well defined. But even so, the data lake concept is still useful because many of the new use cases can be categorized as specific implementations of a data lake. Hortons “Big Data Refinery” is a data lake use case. Pentaho’s “Streamline Data Refinery” is a data lake use case. Most of IBM’s big data use cases are data lake ones, etc. This is helpful because if you can categorize a technology as a data lake one, then you know which kinds of problems it can be used to solve. So the simplicity and generalness of the analogy is part of its power and usefulness.

    To my mind other high-level use cases include CEP/real-time/streaming and big data warehouse. I prefer the term Big Data SQL to data warehouse, because many people in the big data space don’t really know what a data warehouse is or how to build one. They think if you put a lot of data into the system and use SQL to query and aggregate it you have a data warehouse, but you do not, you just have a big table. I consider a data warehouse to be a highly structured and highly cleansed data repository that contains data from every operational system in the organization. Your description above is of a large table or data mart, not a data warehouse. You may consider this dogma, but from my perspective we have a set of well-defined and understood terms that have been around for decades. These terms apply directly to the big data space, and to mis-used and abuse them just causes confusion. I am not a fan of traditional data warehouse solutions and, as you point out, big data warehouse is something different. But if we’re going to build something different, let’s give it a different name. I prefer big data SQL because it is a more general term that does not constrain the problem set in the same way that big data warehouse does.

    Anyway big data warehousing does not fit the data lake analogy, so I consider it to be a separate use case.

    Is Data Lake the “best way” to describe a big data vision? I contend that for certain use cases, it’s still as apt and contemporary as it was. “Big data warehousing” is good too, but it describes a different set of use cases.

    James Dixon
    CTO, Pentaho

  2. Posted June 11, 2016 at 9:56 pm | Permalink

    I am not sure how BigDW is realised “while placing no constraints or limitations”. A DW is all about constraints! I think BigDW will contribute endlessly to confusion & a premature return to DW. I completely agree with Dixon on the difference between DW & DM & the lake.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>