Interest in data lakes continues to grow, but to say that the hype surrounding the data lake is perpetuating confusion throughout the industry would be an understatement. While concepts like data warehousing and big data have found their place, data lakes are still causing IT and business stakeholders to scratch their heads.
As the demand for clear definitions, use cases, and best practices continues to grow, IT professionals need a definitive guide to data lakes that answers the following questions: What are data lakes? How should we be using them? And how are they changing the big data game as we know it?
Definitions and Perspectives
As data lakes become a fast-growing part of the core data architecture, IT professionals often wonder whether the data lake is more of an architectural strategy or an architectural destination. The answer is not clear cut, but there is a way to address the question. By defining the data lake as a centralized repository of enterprise data for various data workloads, the end-state architecture is addressed while data architecture-related decisions to achieving critical mass for the data lake are established.
Data lake adoption continues to increase, and occurs in four critical stages:
- Evaluate technology. By conducting big data pilot projects and focusing on specific business goals and outcomes, individuals using data lakes are able to test the technology while becoming familiar with managing an Apache Hadoop environment.
- React. In this stage, companies begin leveraging Hadoop to tackle existing architecture inefficiencies in order to determine clear and measurable business opportunities. Additionally, this part of the adoption process is crucial for increasing IT efficiency.
- Be Proactive. By taking steps to consolidate data for analytics projects and utilizing Hadoop for affordable scalability, companies can be ready to handle emerging data sources in mass quantities – such as Internet of Things data, social media data, and unstructured data – in a single, centralized repository.
- Establish Core Competency. As the data lake becomes a core component of an IT strategy, companies eventually come to a tipping point at which silos between operational applications and analytic applications are torn down and re-established as a single enterprise platform.
Data Lake Organization
Through the flexibility and scalability of Hadoop, more types of data than ever before can be maintained, cataloged, explored, and utilized. But what truly keeps data lakes from becoming data swamps is governance. The organization and security of data can arguably make or break successful data discovery. Clear and logical data organization – whether by classification or by data usage – can help Hadoop engineers make more sound technical decisions and help analysts and data scientists garner true insights from their data.
Unifying Data Discovery, Data Science, and BI
A major factor driving the deployment of data lakes is support for enterprise BI needs, data discovery, and data science through raw data for machine learning algorithms and statistical functions. Because agile methodologies provide a responsive approach to enterprise BI, leveraging data lakes enables more detailed business transactions, performance indicators, and metrics while serving as a location for historical data to be housed.
With competitive business environments changing at a dizzying pace, companies must understand the crucial role of discovery and the importance of uncovering answers to questions they don’t yet know. This prompts the need for analytics to work directly with the data to determine insights that are meaningful and provide added value to the organization.
Critical Factors to Success
In order for companies to realize optimized benefits from their data lake, a few factors should be taken into consideration:
- Think about data for the long term. Every data project should be started with consideration for the data’s reusability in future applications. By understanding that upcoming and future data needs are often unknown, companies can better prepare and utilize their data accordingly.
- Establish data governance first. Just as the need for data governance has been applied to data and information policies through the enterprise, it should be extended to data lakes as well. Governance establishes a common understanding for how everyone within an organization should be working within the data lakes and helps to minimize room for error and data mismanagement.
- Tackle security needs up front. A data-centric security approach provides a broad perspective to consider data throughout its lifecycle. The key factor in this is to take a head-on approach to security from day one, defining which data will be brought into the data lake and access rights for use of various data within the data lake.
While the data lake is still fairly new to the big data vernacular, it is a part of enterprise IT architecture and overall data strategy that is architecturally sound and goes well beyond the data science and machine learning analytics associated with cheap, commodity infrastructure. An understanding of the key concepts surrounding data lakes can help organizations to better utilize and protect their data, increasing their ability to discover through data.
John O’Brien is Principal Advisor and CEO at Radiant Advisors. With more than 25 years of experience delivering value through data warehousing and BI programs, John’s unique perspective comes from the combination of his roles as a practitioner, consultant, and vendor in the BI industry. His knowledge in designing, building, and growing enterprise BI systems and teams brings real-world insights to each role and phase within a BI program. Today, through Radiant Advisors, John provides research and advisory services that guide companies in meeting the demands of next generation information management, architecture, and emerging technologies.
John will present a session titled “A Principles-Based Data Lake Approach” Oct. 18 at the Teradata 2015 PARTNERS Conference and Expo in Anaheim, CA. For more information about the conference and to register, click here.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.