Conference season is in full swing in the world of data management and business intelligence, and it’s clear that when it comes to the infrastructure needed to support modern analytics, we are in a major transition. To put things in paleontology terms, with the emergence of Hadoop and its impact on traditional data warehousing, it’s as if we’ve gone from the Mesozoic to the Cenozoic Era and people who have worked in the industry for some time are struggling with the aftermath of the tectonic shift.
A much-debated topic is the so-called data lake. The concept of an easily accessible raw data repository running on Hadoop is also called a data hub or data refinery, although critics call it nothing more than a data fallacy or, even worse, a data swamp. But where you stand (or swim) depends upon where you sit (or dive in). Here’s what I’ve seen and heard in the past few months.
Hadoop love is in the air. In February, I attended Strata + Hadoop World in Silicon Valley. A sold-out event, the exhibit hall was buzzing with developers, data scientists, and IT professionals. Newly minted companies along with legacy vendors trying to get their mojo back were giving out swag like it was 1999. The event featured more than 50 sponsors, over 250 sessions on topics ranging from the Hadoop basics to machine learning and real-time streaming with Spark, and even a message from President Obama in the keynote.
One of the recurring themes at the conference was the potential of the data lake as the new, more flexible strategy to deliver on the analytical and economic benefits of big data. As organizations move from departmental to production Hadoop deployments, there was genuine excitement about the data lake as a way to extend the traditional data warehouse and, in some cases, replace it altogether.
The traditionalists resist change. The following week, I attended The Data Warehouse Institute (TDWI) conference in Las Vegas, and the contrast was stark. Admittedly, TDWI is a pragmatic, hands-on type of event, but the vibe was a bit of a big data buzz kill.
What struck me was the general antipathy toward the developer-centric Hadoop crowd. The concept of a data lake was the object of great skepticism and even frustration with many people I spoke with – it was being cast as an alternative to traditional data warehousing methodologies. IT pros from mid-sized insurance companies were quietly discussing vintage data warehouse deployments. An analyst I met groused, “Hadoopies think they own the world. They’ll find out soon enough how hard this stuff is.”
And that about sums it up: New School big data Kool-Aid drinkers think Hadoop is the ultimate data management technology, while the Old Guard points to the market dominance of legacy solutions and the data governance, stewardship, and security lessons learned from past decades. But Hadoop isn’t just about replacing data warehouse technologies. Hadoop brings value by extending and working alongside those traditional systems, bringing flexibility and cost savings, along with greater business visibility and insight.
Making the Case for Hadoop
It’s wise to heed the warnings of the pragmatists and not throw the baby out with the bath (lake) water. As one industry analyst said to me recently, “People who did data warehousing badly will do things badly again.” Fair enough. Keep in mind that, just like a data warehouse, a data lake strategy is a lot more than just the technology choices. But is the data lake half full or half empty? Will Hadoop realize its potential, or is it more hype than reality?
I believe that, as the market moves from the early adopter techies and visionaries to the pragmatists and skeptics, Hadoopies will learn from the mistakes of their predecessors and something better – that is, more flexible, accessible, and economical – will emerge. Traditional data warehousing and data management methodologies are being re-imagined today.
Every enterprise IT organization should consider the strengths, weaknesses, opportunities, and threats of the data lake. Hadoop will expand analytic and storage capabilities at lower costs, bringing big data to Main Street. There are still issues around security and governance, no doubt. But in the short term, Hadoop is making a nice play for data collection and staging. Hadoop is not a panacea, but the promise of forward-looking, real-time analytics and the potential to ask – and answer – bigger questions is too enticing to ignore.
Darren Cunningham is vice president of marketing at SnapLogic.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.