As big data becomes an increasingly integral part of business operations, companies are looking for ways to make data ingestion and analysis easier in order to deliver faster, more insightful results. The data lake has emerged as a popular big data tool thanks to its inherent ability to support the accumulation of data in the original format from a potentially infinite number of sources, such as social media, ticketing systems and sensors. Correlations across data sources are extremely valuable because of their potential to generate many more learnings than a single data source. Used wisely, the resulting insights empower professionals to make data-informed business decisions and build smarter automated processes.
Because the data lake is a relatively new concept, there is a significant amount of misinformation regarding what they are and how they work. Through my interactions with professionals across different industries and fields, and after going through dozens of articles on this topic, it became obvious that a few clarifications are in order.
Here are the most common misconceptions about the data lake:
1) The data lake is a technology.
The data lake can be more precisely defined as a paradigm, rather than a technology. It stems from the realization that keeping and processing data of various types and formats in one place, especially unstructured data from an entire organization, enables the emergence of insight that cannot be derived from single data sources. At its core, the data lake supports the big data endeavors of organizations by paving the way for the discovery of brand new and actionable insights.
2) The data lake is interchangeable with the data warehouse.
Concentrating data from different sources for correlation and modeling purposes is not a new idea. It has been around for years and has been substantiated by data warehouses. However, data warehouses and data lakes are very different concepts, due to their distinct types of focus on data structure, data usage, and data size.
It’s relatively easy to create complex reports if all the available data can fit into columns with records that match each other, as is the case with a data warehouse. The complexity arises from data that might not be in a format the user knows beforehand or where there is no structure to speak of. Examples of unstructured data sources are social media sources, web articles, invoices, and sensor data. Compared to the data warehouse, the data lake is uniquely suited to analyze both structured and unstructured data.
3) The data lake delivers insight on its own.
The data lake must be put to use by professionals with an understanding of the realities behind business processes. It also works best when used in conjunction with a set of processing, interrogation, transformation, and visualization tools. This means organizations need access to the right software & hardware toolkit, as well as the right pool of talent, to fully tap into the benefits of the data lake.
As the complexity of the questions/queries increase, the required tools and technologies change. This is why it is important to approach the operation of the data lake with agile processes, from both a technical and a business perspective.
4) It’s very hard to work with several multi-format data sources in the data lake.
One of the major concerns about the data lake is that even if you can store data in multiple formats, most apps will not be able to operate on all of them simultaneously. Most apps support only a handful, mostly structured data types. Less known is the fact that between the actual data and the apps there are middleware layers that can be inserted to provide “data virtualization.” Data virtualization integrates solutions that unify data formats, providing the apps with the required data type.
At the same time, many organizations that build data lakes also create so called data services. These can be virtual files that link to processes instead of actual files. These processes can regenerate expired data, for instance, or apply more complex security policies.
5) The data lake needs to be built on premises.
If you’ve just started to experiment with big data analytics or with building the data lake, choosing a cloud option would actually be the best way to test different environments, as well as different scales for workloads, before locking budget in on-premises deployments.
6) Cloud-based data lakes are not secure.
The data lake, especially a cloud-based one, is purposefully designed with security in mind. When building and running the data lake in the cloud, organizations can take a range of security measures, the most important of which have become industry best practices:
– Data anonymization: Since large amounts of data will be stored in one place, it is highly recommended that data is anonymized. However, data from sources such as social media, which is already in the public domain, may not need to be anonymized. Given that the anonymization requires compute power and data-specific implementation, it might be more efficient to have a set of security tiers for data, and only anonymize certain sets.
– Encryption: Recent versions of Hadoop (starting with 2.6) provide fairly simple ways to encrypt data in transit and at rest. This makes it possible to encrypt the entire data lake – a practice that is highly recommended.
Another often overlooked security risk is human error. Segregating access and setting different privileges for users should be a key consideration when deploying the data lake.
7) Only data scientists can use the data lake.
The lack of skilled personnel with knowledge in architecting and deploying open source big data projects is often stated as one of the most significant impediments for companies looking to leverage new paradigms and technologies. However, when you stop and examine the skills required for operating the data lake, you might find that your team already has these covered or can quickly learn them.
Building and maintaining the data lake requires a team that:
– Knows or has a mandate to find out about all the various sources of data available in the business;
– Has a mandate to acquire access to these data-sources and set up import routines;
– Has a good understanding of the available tool set for the analysis of the data at hand.
To operate on the actual data via the associated tools requires a team that:
– Has a fair understanding of the business processes across various departments;
– Knows the tools available for transforming, processing, and analyzing the data;
– Has a fairly good understanding of data science, statistics or mathematics;
– Has some algorithmic programming experience: Python, Java and Scala are the most used frameworks.
Let’s wrap up
The data lake is much more than a new piece of technology. It can pave the way to new insight, better decision making, better services, and accelerated business growth. It provides a competitive advantage and has the potential to revolutionize strategic initiative within organizations.
Building the data lake is not a trivial endeavor, but given the transformational potential of having one in place, the rewards are surely worth the effort. With more and more success stories popping up in the market, it is likely that having the data lake will become standard practice, and the companies to capitalize best will be those that reap the benefits soonest.
Alex Bordei is the Head of Product Management at big data platform-as-a-service provider Bigstep and has more than 10 years of experience developing infrastructure products. Before his current role, Alex was one of the core developers for Hostway Corporation’s provisioning platform. He then focused on defining and developing products for Hostway’s EMEA market and was one of the pioneers of virtualization in the company. After successfully launching two public clouds based on VMware software, he created the first prototype of the Bigstep platform in 2011. He is engaged in mapping out ever more useful perspectives on the big data paradigm in order to encourage exploration and innovation through big data. You can reach him at email@example.com.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.