Top 4 Ways a Data Lake is Different from a Data Warehouse

by   |   August 25, 2016 1:30 pm   |   2 Comments

Kim Hanmark, Director of Professional Services EMEA, TARGIT

Kim Hanmark, Director of Professional Services EMEA, TARGIT

It’s rare to get far into a conversation about “modern BI” these days without someone bringing up the topic of the data lake. While some in the business intelligence world are excited about this relatively new style of data repository because of its big data capabilities, others lambast it for its drawbacks, most commonly the lack of data governance. Either way, ignorance is not bliss when it comes to data lakes – a data lake is not a data warehouse, and the differences are significant.

Here are the top four ways a data lake is different than a data warehouse:

Data Types

In a data warehouse, data is carefully considered and structured before being pulled in. This is known as a “schema on write” approach to data storage. A data lake however, takes all data in its original form. That includes data that would be useful to analyze today, in the future, or perhaps never at all.

Unlike a data warehouse, a data lake can support every data type, including non-traditional data types such as text, images, social media content, and web server logs. This is because a data lake maintains data in its raw format and only transforms it when it is ready to be analyzed. This approach is known as “schema on read.”

A data warehouse only stores data. Data that needs to be analyzed is taken from the data models on top of the data warehouse that processes it in a highly structured format. A data lake, however, processes data in its raw format. Whichever form it comes in is how it will be analyzed before it goes out.

Speed

Processing, cleansing, and transforming data for a data warehouse solution design takes time. Because this step is eliminated in a data lake, users have instant access to the data they want to analyze. Information designers can quickly configure, re-configure, and otherwise experiment with data on the fly for powerful ad-hoc purposes.

Related Stories

Yield Big Results with Data Lakes and Automation.
Read the story »

Avoiding Three Common Pitfalls of Data Lakes.
Read the story »

Use Semantics to Keep Your Data Lake Clear.
Read the story »

Big Data Warehousing: Leaving the Data Lake Behind.
Read the story »

This type of agility isn’t for everyone, though. Not everyone wants to or has the proper skills to get their hands dirty with data exploration. And the very nature of raw data means that data governance in a data lake is essentially nonexistent. Data governance is the responsibility of the users, who should employ tactics such as a closed-loop system, or sandbox analytics. Otherwise, the data lake could become a mess of disconnected silos and unusable data. Governance frameworks like Apache Atlas also can be deployed in scenarios where data-governance policies must be enforced.

Usability 

Data warehouses are extremely powerful. They are designed to make it easy to link data across various dimensions. However, they also can be extremely cumbersome. Of the various types of users who utilize business intelligence on a daily basis, only the highly technical information designers can get under the hood and make changes to a data warehouse.

A data lake, however, is much more agile, making it ripe for modern BI systems. Information designers can fully immerse themselves in the large and varied data sets they need, while more casual business users can pick and choose from the more structured data sources within the data lake. The structured data is easily ordered and processed within the data lake, resulting in an output of analyzed data that users can sift through quickly to gain insight.

Flexibility

By definition, a data warehouse is highly structured. While this makes it a powerful storage option, it makes changes within the data warehouse difficult. Therefore, the biggest benefit of the de-normalized data warehouse is also its flaw. Any work done within a data warehouse falls to highly skilled IT staff. Ad-hoc analytics are impossible with just a traditional data warehouse structure, as any new data first has to be folded into an appropriate cube.

That’s why the increasing demand for self-service business intelligence and modern BI makes a data lake highly attractive. Users are empowered to utilize and experiment with data outside the data warehouse and don’t have to wait for IT to find time for their requests.

That’s not to say, however, that the flexibility of the ungoverned data lake doesn’t come with a toll. Don’t forget that unstructured data can quickly lead to chaos for those who don’t know what they are doing.

The Good News 

When considering data lakes and data warehouses, it doesn’t have to be an either/or decision. Why not go bimodal and harness the power of both?

Newer tools have emerged that make it possible to bridge the gap between the data warehouse and a data set such as the data lake. This enables users to blend data outside the data warehouse with data within it, making it possible to experiment with and prototype data outside the data warehouse.

A data lake is a low-cost alternative for data storage for companies that want to utilize external data and can pull directly from hundreds, if not thousands, of external data sources. A data lake can serve as a dumping ground until that data is pulled into the front-end business intelligence system. This makes the process significantly faster. Data lakes also encourage self-service data discovery. All of this, combined with the structure and security of a data warehouse, make for unrivaled access to actionable insight.

Whether you choose a data lake, data warehouse, or a combination of both, at the end of the day, the solution should promote increased use and sharing of data to best meet business goals.

Kim Hanmark is Director of Professional Services EMEA at TARGIT. He has worked within the technology space since the early ’90s, starting as a software developer building solutions for enterprise class companies. After the IT bubble burst in 2001, he transitioned into business management – still with a technology perspective. He has spent the past five years helping large corporations become successful in increasing the adoption of Information Management technologies.

Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.



Anzo Smart Data Lake [Whitepaper]




Tags: , , , , , , , ,

2 Comments

  1. Kasun Rajapaksha
    Posted March 26, 2017 at 11:27 pm | Permalink

    I found data lake concept very attractive, but meantime It’s very hard to find a user friendly exploration tool for a data-lake. Power BI is a good option with GUI based operations. Other than that I could not found a good solution which empowers users, yet user friendly

  2. Chowderhead
    Posted July 14, 2017 at 8:32 am | Permalink

    The Data Lake concept sounds like a lazy way for IT to provide analytics. Rather than do the upfront work needed to unify the data to provide true, enterprise-wide metrics – we just leave the user to do it instead.

    The result is (and I say this from first had experience) a data dumping ground of ungoverned, duplicated, and questionable data.

    When did this table get refreshed last? Who knows? Is the data complete or was there a filter applied? Again, who knows? What does the flag IsTpcMx on this table mean? Ditto.

    But the real fun doesn’t really begin until different people produce the same number with different values.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>