Ever since Pentaho CTO James Dixon purportedly introduced the term “data lake” in late 2010, the idea seems only to have grown in stature. Where it was once but a technology artist’s concept of how big data might be imagined, the data lake now is poised to become not only accepted nomenclature, but also an ultimate goal of big data strategy, as in a nonchalant, “So what’s your data lake strategy?” from your friendly neighborhood technology vendor.
Everyone, it seems, has taken to the term data lake. Cloudera uses it. Hortonworks uses it. CapGemini and Pivotal have collaborated on a “business data lake” solution, and Informatica seems to have signed on to it. GE speaks of “angling in the (industrial) data lake.” Fortune speaks of a “data lake dream” as a place with data-centered architecture, where silos are minimized and applications are no longer islands, existing (in a rather mixed metaphor) “within the data cloud.” Gartner would like us to “beware of the data lake fallacy.” (Another linguistic snafu – Garter is not calling data lake a fallacy, as the wording suggests, but seems to have bought into the concept and is merely pointing out that the concept carries risks.) In a column on Data Informed last week, Darren Cunningham of SnapLogic wrote of the recent Strata + Hadoop World conference held in San Francisco, where he found “genuine excitement about the data lake as a way to extend the traditional data warehouse and, in some cases, replace it altogether.”
James Dixon used the term “data lake” to refer to something that was not pre-formatted, contrasting it with a store of bottled water. The implications seem clear: Just as you are not limited to getting water in a pre-defined format, a data lake can be used to fill a container of any form or size – a small bottle or a full tanker. It’s a clever analogy. Notice the implicit “late binding” to consumption format. From a sourcing perspective too, the analogy does hold some water (no pun intended): As a real lake can be sourced from all kinds of sources, so can a data lake.
But on deeper examination, the analogy begins to weaken – or, shall we say, the water begins to get a little muddy. The essential weakness of the analogy is that when we draw water from a lake, we have no control over which kind of water (and from what source) we get. Analysts, bloggers, commentators, and detractors found amusing ways to extend and exploit the analogy, with terms such as “data swamp” or “data cesspool.” An imaginative friend of mine recently referred to it as a “data landfill,” a neat twist on the original concept.
But, all in all, it remains an elegant, if simplistic, analogy. Five years ago, simplicity was necessary to introduce a relatively new and complex concept. Five years later, the analogy appears dated and inadequate. Much water has flown under the bridge since then (pun intended). We now have a much broader and deeper understanding of big data and the nuances thereof, technologies to address big data have matured and improved significantly over the last few years, and Hadoop 2.0 is turning out to be much more than a marketing gimmick for competitive positioning. Use cases, no longer confined to research labs and brilliant deviants, are now radiating across the industry spectrum, demonstrating the business value of big data, and associated architectural patterns are emerging that demonstrate what works and what doesn’t, simultaneously helping educate us and point out directions for future Hadoop product development.
To continue to use the term “data lake” today is to fail to understand or acknowledge how far we have progressed since Dixon felt the need to simplify the vision of a complex and revolutionary new concept. It would be interesting to learn if Dixon continues to think privately that “data lake” is the best way to describe a big data vision, but I fear it’s too late to ask. Even if he were to think so, it would be unfair to expect him to admit the limitations of a term, attributed to him, that is now practically de rigueur when discussing big data.
It seems much too late to put the data lake genie back in the bottle, but it is possible to wade beyond the concept.
I prefer to think of the big data paradigm in terms of two simple but powerful phrases that go beyond buzzwords and marketing-speak: big data strategy, and big data warehousing (Big DW). “Big DW” is admittedly far less imaginative and catchy than “data lake,” but it captures all the potential as well as complexity of big data in a familiar paradigm while placing no constraints or limitations – except to the dogmatic, who may be tend to constrain their concept of data warehousing in terms of implementation architectures (e.g., star schemas and conformed dimensions). Hadoop offers a different vision and set of capabilities – and, of course, scale – for the warehousing of data, but the fundamental value proposition for businesses remains the same: a mechanism to aggregate and process data to derive insight into the current state and the future.
Big DW goes beyond conventional data warehousing and represents an important next-generation tactic in the endless corporate struggle of survival of the fittest. And with the increasing maturation of Hadoop, related technologies, and use cases, we are at a very exciting point in time.
But, alas, until someone comes up with sexier alternative to “data lake” (one that is also more apt and contemporary), I fear we have no option but to continue to tread water.
Rajan Chandras is director data architecture and strategy at a leading academic health center in the northeast. He is a prolific contributor to well-known industry publications and has presented at industry and research conferences. He writes for himself and not for or on behalf of his employer. And, to his chagrin, finds himself continuing to use the term “data lake” on occasion.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.