Having spent many years building first- and second-generation data warehouses, I questioned why so many industry-leading companies are embarking on “next-generation” data warehouse programs. It turns out the answer is pretty compelling: to reduce the time-to-answer and the cost-per-answer; achieve limitless scalability; support an increasing number of data types such as structured, semi-structured, and unstructured; integrate enterprise data with external digital data; and to support IoT analytics. In a nutshell, legacy environments simply cannot keep up with the volume and variety of today’s data demands affordably.
The next generation data management environment overlays two decades worth of data management lessons onto fit-to-purpose scalable technologies and more tightly couples advanced analytics, dashboards, and enterprise databases.
Here are eight defining elements of the next generation enterprise data warehouse environment:
- Enterprise Data Lakes. Data lakes will replace multiple ETL hubs or landing areas. The singular Hadoop-based enterprise data lake will be used for landing and pre-processing all data at the atomic level. The Hadoop ecosystem’s potential benefits for data management and analytics are staggering – offering breakthrough economy and limitless flexibility and scalability. Its low cost and indifference to data type or structure make it ideal for enterprise data as well as semi- and unstructured big data from IoT and digital sources, including social media. So that the data lake does not become a data swamp, organizations will rely on a mature, enterprise-strength Hadoop framework to provide a roadmap for data enrichment servicing operational, analytical, information delivery and consumer-facing needs through well-defined stages of an integrated platform. This will ensure the security, manageability, reliability, and cost performance expected of enterprise-class systems.
- Self-Service Analytics. Self-service analytics will become table stakes, just as IT-supplied dashboards are today. Next-generation business intelligence (BI) apps will empower business users with data discovery and data visualization capabilities. To support the increase in the number of power users, self-service capabilities will break the paradigm of being dependent on IT. This will be achieved by tools that remove the traditional semantic definition layer, as they will offer intuitive, rapid response development environments. Enterprise-standard tools will support in-memory processing and embedded analytics to dramatically improve query performance, while business users will have the tools they need for rapid prototyping and storyboarding. The look and feel will be universal across desktop and mobile devices, and data scientists will have big data analytic sandboxes containing complete data sets versus samples.
- A Data Governance Foundation. The policies, processes, standards, and tools that enable effective use of enterprise data assets will be rolled out early rather than as an afterthought. The office of the Chief Data Officer will be widespread to define data governance organizations and processes from the top down. Data ownership and issue-resolution processes will be established across the relevant business units and operations areas. The processes, organization, and tools for data quality management will be deployed from the onset.
- Pooled Compute Infrastructure. This will be the norm – i.e., dedicated boxes will be the exception. Pooled compute resources will take a variety of forms, such as grid computing clusters, high-performance specialized MPP platforms, virtualized private clouds, and public clouds. The advantage of pooled platforms is that enterprises can reduce equipment costs by more effectively managing utilization, and new projects can be supported by adding compute resources. As a result, the business can eliminate the time and cost of an infrastructure sub-project that is all-too-often baked into many analytics projects.
- Data Ecosystems Leveraging ‘Best Fit’ Data Platforms. This will be the rule of thumb. To scale affordably with emerging data volumes and varieties, next-generation data warehouse environments will employ multiple fit-to-purpose data management systems (DBMS) rather than a single repository, one-size-fits all strategy. For enterprise-class demands to support hundreds or even thousands of online users and background processes, row-oriented databases, particularly MPP databases that scale linearly, will continue be the preferred solution. Hadoop solutions will be used for landing and staging of large data volumes to achieve cost and scalability advantages. Column-store databases will be a tool of choice for many applications that require ultra-fast response with minimal compute resources, such as BI dashboards and many “write once/read many” workloads. NoSQL applications will be used for big data applications that use flexible schema and constant time retrieval methods.
- Waterfall and Agile Methodologies. These will be employed on a “best fit” basis. Next-generation data warehouse development methodology will meld Agile and Waterfall SDLC approaches based on “Project” and “People” risk profiles. Waterfall can work well in many situations, particularly when things are cut and dry. The higher the risk profile, the more likely that Agile is a best fit. Project characteristics that drive up risk include environment instability such as new hardware or tools and uncertain business requirements. People characteristics that drive up risk include geographic distribution of team members, skills deficiency, and the level of user engagement.
- In-Database Analytics. In-database analytics will be in wide use. The advantages of in-database analytics are multifold. First, keeping analytics in the database helps move toward a single version of the truth. Conversely, divergent views result from redundant data sets and algorithms coded multiple times. Second, in-database analytics reduce the costs related to the development of redundant ETL code and redundant storage. Third, in-database analytics makes the use of entire data sets versus samples more seamless.
- Decommissioning Will Be a Journey, not a Destination. This is what we have said about data warehousing for 20 years. Now we have to take down much of what’s been built. To save dollars, accelerate answers, and move toward a single version of the truth, next-generation data warehouse programs will move toward fewer BI and ETL tools and fewer silo data warehouses. Fewer tools means reduced license and infrastructure costs combined with an enhanced ability to stretch skilled labor across more project needs. Reducing the number of data warehouse silos will promote more consistent answers and save money by reducing the labor needed for redundant ETL and DBMS edifices.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.