How to Keep Data Lakes Clean and Actionable

by   |   June 29, 2017 5:30 am   |   1 Comments

Timo Elliott, VP, Global Innovation Evangelist, SAP

Timo Elliott, VP, Global Innovation Evangelist, SAP

Data lakes are a big opportunity to store large amounts of data in an affordable way without having to decide upfront how it must be structured and used. They are typically used to complement traditional data warehouses, which are still better adapted for highly-trusted, tightly-governed data such as your financial figures, but there are some overlaps between the two compositories.

Data lakes compared to data warehouses are analogous to spreadsheets compared to traditional business intelligence tools. One of the reasons spreadsheets remain so popular is that you can do whatever you like with the data, free of annoying restraints (like not being able to tweak the disappointing sales figures!). Data lakes bring information together from lots of different sources, like traditional data warehouses – but offer more autonomy for users and far fewer constraints.

This independence has some big upsides. Data lakes provide the perfect environment for fast, iterative analytic experimentation with varied data sets. Huge amounts of information can be stored and used to uncover deep correlations to inform product or marketing strategies, and more. And anyone in the business can generate queries for themselves, without IT as a bottleneck.

Muddy data lakeBut the data lake approach can also have some big downsides. Just like spreadsheets, too much freedom can lead to problems. Without governance, people have the flexibility to do things incorrectly, in ways that can be hard to detect and correct. The result can be “data dissonance” – multiple pools of duplicate or erroneous data. Different teams may end up needlessly recreating the same analytics from scratch, using different – and incompatible – definitions of key business terms.

There’s no magic solution to getting high-quality data.  Data lakes need to be governed and maintained; if not, they can easily turn into data swamps full of stagnant data of dubious provenance – making it hard to glean useful insights.

Here are some key steps to ensuring your data lake meets your business goals:

Assign data owners. Any valuable resource is liable to be squandered unless ownership is clearly defined. Good governance comes down to people, not technology. It’s important to have a company-wide program to ensure that every important data source has an identified owner with the responsibility, incentives, and resources necessary to maintain high-quality information.

we call them our data puddlesKeep track of what’s in there. Various solutions are emerging that allow organizations to have a clear understanding of what is available in the data lake. These insights provide not just metadata and technical information, such as which system it originated from and when it was uploaded, but also intelligence about the owner, a data quality rating and more. New collaborative solutions are emerging that allow for crowdsourcing – a sort of  “Yelp for data,” where  users can vote and make comments on a catalog of different data sources. Regular data curation is essential to ensure that the information remains relevant and up-to-date and that overlapping data from different teams is minimized.

Establish clear data retention policies. Just because you can keep data forever doesn’t mean you should. As both lawyers and data scientists will tell you, more data is not necessarily better. Every industry and jurisdiction has different requirements for data retention, so be sure to double-check that you are maintaining compliance. However, while data storage costs continue to plummet, it can still be prohibitively expensive to store, for example, all the raw data from all your IoT sensors. Some sort of filtering and culling will always be necessary – and this decision must be regularly reviewed in the light of new technology options and changing costs.

Implement strong data security and compliance. Barely a week goes by without another big cybersecurity breach. Companies must have strong data governance processes for their data lakes. This means there must be data usage policies that outline who can access, distribute, change, delete or otherwise manipulate the information that goes in and out of the lake. New regulations such the European General Data Protection Regulation (GDPR) enforce strong legal requirements for data protection and data privacy with severe fines in the event of non-compliance (up to 4 percent of worldwide turnover!). Much of the technology underpinning data lakes is open source and typically harder to govern than traditional databases and data warehouses – be prepared to invest in ongoing improvements.

Link the data lake to other data sources. Data lakes are sometimes sold as a single source of all data needed for business analytics. But that’s never going to be the reality of a real environment. While it often makes sense to lighten the load of existing data warehouses, most organizations are continuing to invest in them. This is notably the case for finance data, where no-holds-barred experimentation is discouraged! Furthermore, research by Forrester[i] indicates that only 45 percent of the data that businesspeople use in their daily work actually comes from internal systems. The rest is from sources such as third-party data feeds and personal spreadsheets that are unlikely to ever make their way to the data lake. The result is that business processes will always need data from multiple systems, and new solutions are now emerging that allow the creation of governed “data pipelines” that leverage the advantages of each underlying technology. Until these technologies come to fruition, ensure that the data lake is connecting to the appropriate systems.

Act on the insights. Once the guidelines have been set for managing, retaining, and cleaning the data, enterprises need to be agile enough to actually act on the resulting insights. The Climate Corporation is a perfect example – it uses a combination of data lakes to collect massive amounts of agricultural information and applies machine learning to help farmers optimize their planting.

Recent innovations have made data lakes a promising source for companies to tap and extract actionable data to innovate. However, there are a lot of steps needed to ensure data lakes are implemented properly and kept clean. As with any data projects, there are many barriers, but the outcome can outweigh it all. For inspiration, you can find data lake case studies here.

______________

[i] Forrester Research: Business-Driven Agile Enterprise Business Intelligence (BI), Transforming BI To Get The Best Of Both Worlds

 

Timo Elliott is an Innovation Evangelist for SAP. For the last 25 years, he has worked closely with companies around the world on the business impact of new analytic technologies. He has presented to business audiences in more than 50 countries on topics such as Digital Transformation, Big Data Analytics, the Internet of Things, and Artificial Intelligence. His articles appear regularly in publications such as Forbes, ZDNet, The Guardian, and Digitalist Magazine. He was the eighth employee of BusinessObjects, a world-leading analytics provider that was acquired by SAP in 2007. He has worked in the UK, Hong Kong, New Zealand, and the U.S., and currently lives in Paris, France. He graduated with a first class honors degree in Econometrics from Bristol University, UK, and holds a patent in the area of mobile analytics. 

 

Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.
Tableau whitepaper - why business analytics in the cloud?

Tags: , , , , , , , , , , , , , , ,

One Comment

  1. Posted August 31, 2017 at 1:58 am | Permalink

    Thanks to the author for sharing this impressive blog. Really glad to read this article. This site has lots of information and it is useful for us @ tekslate

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>