How data marketplaces are superseding ETL as the right approach for data preparation
Recently, Gartner declared that the “shift from IT-led reporting to modern business-led analytics is now mainstream.” (Magic Quadrant for Business Intelligence and Analytics Platforms, February 16 2017) Gartner continued by saying that in the future BI tools will focus on helping business users and citizen data scientists drive the data discovery, visualization, and analytic process on their own – without help from IT or the need for predefined data models.
But data analysts and their IT partners accelerating down the road of end-user analytic empowerment should know that there is a roadblock just around the corner – namely reliance on traditional extraction/transformation/load (ETL) tools for data preparation.
Last Mile vs. Enterprise Scale Data Preparation
Modern BI tools, with built-in support for self-service data preparation or data wrangling, only solve part of the ETL problem. While they help users with the “last mile” of data preparation – building and retrieving custom data sets from relatively clean databases or flat files – they don’t tackle the more difficult upstream challenge of enterprise-scale data preparation.
Traditionally, the sweet spot for ETL tools, enterprise-scale data preparation includes all of the tasks associated with moving, preparing and managing data from enterprise, legacy, mainframe or third-party systems into an enterprise-scale data management platform.
ETL tools remain a good fit for data preparation applications where the IT team can map out the data ETL process ahead of time, codify that into a relatively stable set of data flows and transformations and deliver new data into a relatively static, well-defined target schema and platform.
Modern Business-Led Analytics Demand a New Approach to Data Preparation
The challenge is that today many data preparation requirements don’t match this description. Business users and data analysts are looking to prepare data in new ways, often with no up-front definition of what data they need or how they plan to use it. With limited technical skills and a need to get new data in hours or weeks (versus months), these users need a different way to handle the enterprise-scale part of data preparation, which is independent of the IT team, and beyond the “last mile” the data preparation capabilities included in modern BI tools. A closer look at this emerging requirement for a different type of data preparation reveals key differences in five areas. Let’s explore each of these in more detail.
1) Who is driving the data preparation process?
2) How quickly do they need the data?
3) What skills do they have?
4) What do they need to know about the data and when?
5) What kind of data to they need – raw, processed or both?
Who is driving the data preparation process?
Within most companies today, there are hundreds or thousands of business users who need access to new data quickly – in hours or days – to keep with up with the expanding backlog of analytics projects on their plate. These people need to be able to flexibly and iteratively discover and define what data they need in each new data set and how it should be organized.
But, rather than empowering business users to prepare data in a distributed, agile, real-time way, ETL tools were designed to help IT teams drive the data preparation process centrally, with the goal of delivering specific data sets into a defined data model in a database, data warehouse or data mart.
How quickly do they need the data?
Additionally, with ETL tools, it typically takes weeks or months for new data to make it through the gathering, data modeling, data mapping, testing and deployment process before business used can have at it. This is too slow and too rigid to meet today’s needs.
What skills do they have?
ETL tools generally require programming skills and specialized knowledge of data and enterprise systems. This limits the pool of people who can use them to a relatively small group of IT specialists and ETL programmers. As long as ETL tools are the solution of choice within an organization for data prep, data scientists and data savvy business users will be unable to build their own data sets quickly and flexibly.
What do they need to know about the data and when?
While ETL tools generally include some provision to view the metadata and lineage for data, that access is generally designed for technical people using specialized systems like metadata manager applications. Often, end users looking at data prepared and delivered through an ETL tool can’t see that metadata or lineage information, leaving them uncertain and skeptical about the data’s usefulness or trustworthiness.
What kind of data to they need – raw, processed or both?
Finally, the basic metaphor for an ETL tool is a pipe: Raw data goes in one end and scrubbed, structured data comes out the other end, having been transformed along the way. All of the stages of the data as it went through that transformation are not preserved, or at least not visible or accessible to people outside the ETL team.
This is at odds with the needs of today’s increasingly diverse data consumers, who differ significantly in terms of their analytics skills and data needs. While some people might want highly processed “gold standard” data for standardized reporting, others might need raw data for building advanced analytic models. Still others may need to mix-and-match raw and finished data to build a new fit-for-purpose data set. Because ETL tools only deliver finished data to the business, they don’t meet the diverse data needs of today’s data consumers.
If not ETL tools, then what?
So if ETL tools are no longer the right solution for data preparation in the modern era of business-led analytics and self-service data prep, what is? Data marketplaces are emerging as the modern alternative to ETL tools.
A data marketplace is an enterprise data management platform, built for the era of big data and pervasive analytics, that supports both “last mile” and enterprise-scale data preparation. Let’s look at the specific requirements of this new type of data preparation – and explore how data marketplaces meet the need.
Improved Accessibility, Reliability, and Transparency
What makes a marketplace different from ETL tools is that it is a turnkey application with a GUI interface and requires no technical skills to use. As a result, a much broader group of users can use a data marketplace to prepare data, either for simple “last mile” data wrangling or to drive more complex enterprise-scale data preparation.
Additionally, a marketplace is different because it creates a persistent set of data – an enterprise-scale data lake – that contains data at every point in the raw-to-ready process. Unlike ETL’s tools that only give users access to finished process data, the data marketplace allows users to access and use data from any point in the data prep process, depending on their specific needs.
Finally, in a data marketplace users always have direct access to lineage and other metadata describing how each piece of data has been prepared or transformed. This helps users understand, trust and ultimately use data more effectively.
A core argument in favor of data marketplaces is that when implemented correctly, they massively accelerate data preparation and speed delivery of new data to business users. For example, it can take up to 6 months to deliver data into a data warehouse via an ETL tool, while the same process can be done in a couple days with a data marketplace. And with more people able to prepare their own data directly, IT and ETL specialists can direct more resources towards strategic oversight of the data marketplace platform or other high priority projects.
Today, businesses – with their growing armies of data analysts and exploding big data volumes – are under extreme pressure to broaden their use of data to drive analytic insights and business agility. The entire self-service data preparation process depends heavily on the ability of IT teams to find faster more scalable ways to deliver new data to business users faster.
Built for an earlier era, ETL tools are good at building highly-structured data set that rarely change, for example data for regulatory or financial reporting, but for most data preparation tasks, a new solution is needed.
Data marketplaces are emerging as the modern replacement for ETL tools. By combining enterprise-grade data management and preparation capabilities with a GUI interface and robust metadata layer on top of a secure well-governed data lake, a data marketplace delivers the flexibility, speed and scale needed to meet the data preparation and delivery needs of today’s modern agile analytics driven organization.
Paul S. Barth, PhD
CEO, Podium Data
Dr. Paul Barth has spent decades developing advanced data and analytics solutions for Fortune 100 companies, and is a recognized thought-leader on business-driven data strategies and best practices. He is a founder and the CEO of Podium Data created to help companies make practical use of the exploding amount of data – from legacy systems to unstructured social media streams – to make order out of chaos and deliver actionable business intelligence. He holds a PhD in computer science from the MIT, and an MS from Yale University.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.