SAN JOSE, Calif. — The advantages of using data lakes as a way to corral big data by putting it all in one place, or “lake,” are well documented.
Ah, if only it were that easy. While it can be handy for data scientists, analysts, and others to pull disparate data sources from a data lake, finding what you need and reconciling file compatibility issues can make it tricky.
At the recent Hadoop Summit here, data lakes were a controversial topic. Walter Maguire, Chief Field Technologist at HP’s Big Data Business Unit, said the data lake is a relatively young concept and noted that it has had its share of criticism, with some saying that, in practical terms, it would be more accurate to call it a “data barn” or “data swamp.” As part of a broader presentation on the topic, he talked up HP’s own Haven for Hadoop solution as a way to make murky data lakes “clear” for data scientists and others to get at the data they need.
Ron Bodkin, Founder and President of Teradata company Think Big, focused his keynote on data lakes, noting the pros and cons as well as customer examples of successful implementations.
“To us, a data lake is a place where you can put raw data or you can process it and refine it and provision it for use downstream,” said Bodkin. “That ability to work with a variety of data is really critical. We see people doing it fundamentally so they can manage all their data and can take advantage of innovation in the Hadoop community and beyond.”
He noted that new tools such as Presto (SQL Query Engine), Spark (for running in-memory databases at greatly higher speeds), and Storm (a distributed real-time computation system for processing large volumes of data) make working with Hadoop faster and more effective. He said that Teradata has a team of 16 people working on enterprise Presto development.
Bodkin offered the example of a high-tech manufacturer that keeps data about its manufacturing processes around the world in a data lake. Used effectively, the data lake lets the company trace its parts and improve the yield, leading to faster time to market and reduced waste. “Putting a Hadoop data lake into their manufacturing system has been a major accomplishment,” he said.
But not all data lake implementations are so successful. Acknowledging the “data swamp” issue, Bodkin said the first wave data lake deployments typically have been for one use case, some specific thing that a company wanted to accomplish.
“What happened though is these systems have grown as companies are putting in thousands of tables, dozens of sources, and many more users and applications, and it isn’t scaling,” he said. “We have seen this dozens of times with companies that started working with Hadoop two years ago and now it’s a nightmare of a hundred jobs they have to monitor, run, and worry about.”
Building a Mature Data Lake
From Teradata’s enterprise customer engagements, Bodkin said, he’s come to some conclusions about what constitutes a “mature data lake” that can be a trusted part of the enterprise infrastructure.
“Underlying it, of course, is the ability to have security, to be able to secure data,” he said. “You need to have regulatory compliance, the ability to archive data so if you have deep history, the ability to store data in an efficient way that’s not used as often but when it is, in an active way.”
He also pointed to the need for a metadata repository so you can index and find what you need easily, and noted that efficient file transfers into the cluster are important because a lot of them can get cut.
“Whatever the means of ingestion, you need to govern it,” he said. “You need to be able to trace as the data’s going into the data lake and what versions.”
The governed data is that trusted version that you can use downstream to, for example, create a data lab where data scientists can play with new data that comes in and combine it with other data in the repository using some of the new tools that have been recently released.
Wrapping up, Bodkin noted that many companies are still trying to get their footing on how a data lake can help them. Teradata’s approach is to offer a road map designed to help companies plan out what they want to do, including showing how a data lake can be built in a scalable way that can be governed.
“There are best practices and patterns you have access to so you don’t incrementally work your way into a data swamp,” said Bodkin. “You can start off on a clean footing.”
Veteran technology reporter David Needle is based in the Silicon Valley, where he covers mobile, enterprise, and consumer topics.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.