It’s rare to get far into a conversation about “modern BI” these days without someone bringing up the topic of the data lake. While some in the business intelligence world are excited about this relatively new style of data repository because of its big data capabilities, others lambast it for its drawbacks, most commonly the lack of data governance. Either way, ignorance is not bliss when it comes to data lakes – a data lake is not a data warehouse, and the differences are significant.
Here are the top four ways a data lake is different than a data warehouse:
In a data warehouse, data is carefully considered and structured before being pulled in. This is known as a “schema on write” approach to data storage. A data lake however, takes all data in its original form. That includes data that would be useful to analyze today, in the future, or perhaps never at all.
Unlike a data warehouse, a data lake can support every data type, including non-traditional data types such as text, images, social media content, and web server logs. This is because a data lake maintains data in its raw format and only transforms it when it is ready to be analyzed. This approach is known as “schema on read.”
A data warehouse only stores data. Data that needs to be analyzed is taken from the data models on top of the data warehouse that processes it in a highly structured format. A data lake, however, processes data in its raw format. Whichever form it comes in is how it will be analyzed before it goes out.
Processing, cleansing, and transforming data for a data warehouse solution design takes time. Because this step is eliminated in a data lake, users have instant access to the data they want to analyze. Information designers can quickly configure, re-configure, and otherwise experiment with data on the fly for powerful ad-hoc purposes.
This type of agility isn’t for everyone, though. Not everyone wants to or has the proper skills to get their hands dirty with data exploration. And the very nature of raw data means that data governance in a data lake is essentially nonexistent. Data governance is the responsibility of the users, who should employ tactics such as a closed-loop system, or sandbox analytics. Otherwise, the data lake could become a mess of disconnected silos and unusable data. Governance frameworks like Apache Atlas also can be deployed in scenarios where data-governance policies must be enforced.
Data warehouses are extremely powerful. They are designed to make it easy to link data across various dimensions. However, they also can be extremely cumbersome. Of the various types of users who utilize business intelligence on a daily basis, only the highly technical information designers can get under the hood and make changes to a data warehouse.
A data lake, however, is much more agile, making it ripe for modern BI systems. Information designers can fully immerse themselves in the large and varied data sets they need, while more casual business users can pick and choose from the more structured data sources within the data lake. The structured data is easily ordered and processed within the data lake, resulting in an output of analyzed data that users can sift through quickly to gain insight.
By definition, a data warehouse is highly structured. While this makes it a powerful storage option, it makes changes within the data warehouse difficult. Therefore, the biggest benefit of the de-normalized data warehouse is also its flaw. Any work done within a data warehouse falls to highly skilled IT staff. Ad-hoc analytics are impossible with just a traditional data warehouse structure, as any new data first has to be folded into an appropriate cube.
That’s why the increasing demand for self-service business intelligence and modern BI makes a data lake highly attractive. Users are empowered to utilize and experiment with data outside the data warehouse and don’t have to wait for IT to find time for their requests.
That’s not to say, however, that the flexibility of the ungoverned data lake doesn’t come with a toll. Don’t forget that unstructured data can quickly lead to chaos for those who don’t know what they are doing.
The Good News
When considering data lakes and data warehouses, it doesn’t have to be an either/or decision. Why not go bimodal and harness the power of both?
Newer tools have emerged that make it possible to bridge the gap between the data warehouse and a data set such as the data lake. This enables users to blend data outside the data warehouse with data within it, making it possible to experiment with and prototype data outside the data warehouse.
A data lake is a low-cost alternative for data storage for companies that want to utilize external data and can pull directly from hundreds, if not thousands, of external data sources. A data lake can serve as a dumping ground until that data is pulled into the front-end business intelligence system. This makes the process significantly faster. Data lakes also encourage self-service data discovery. All of this, combined with the structure and security of a data warehouse, make for unrivaled access to actionable insight.
Whether you choose a data lake, data warehouse, or a combination of both, at the end of the day, the solution should promote increased use and sharing of data to best meet business goals.
Kim Hanmark is Director of Professional Services EMEA at TARGIT. He has worked within the technology space since the early ’90s, starting as a software developer building solutions for enterprise class companies. After the IT bubble burst in 2001, he transitioned into business management – still with a technology perspective. He has spent the past five years helping large corporations become successful in increasing the adoption of Information Management technologies.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.