We hear a lot today about streaming data, fast data, and data in motion. It’s as if until now data has been stagnant, just sitting in some dusty database and never moving. The truth is that we have always needed ways to move data.
Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to today’s real-time streaming ingest systems, we have continued to adapt and create new ways to move data sensibly, even as its appearance and motion patterns have dramatically changed.
In our new data driven world, the practice of exerting firm control over our data in motion is an increasingly critical competency and is becoming core to successful business operations. Based on more than 20 years in enterprise data, here is how I see where we have been and where we are going as we evolve into a world in which the full force of data volume, velocity, and variability takes hold.
First Generation: Stocking the Warehouse via ETL
Let’s roll back a couple of decades. The first substantial data-movement problems that plagued the mid-1990s emerged with the trend toward data warehousing. The goal was to move transaction data provided by disparate applications or residing in databases into the newly minted data warehouse. Organizations operated a variety of applications, such as offerings from SAP, Peoplesoft, and Siebel, and a variety of database technologies like Oracle and IBM. As a result, there was no simple way to move the data; each was a bespoke project requiring an understanding of vendor-specific schemas and languages. The inability to “stock the warehouse” efficiently led to data warehouse projects failing or becoming excessively expensive.
ETL tools addressed this initial data-movement problem by creating connectors for applications and databases to load the warehouse. For each source, one needed only to specify the fields and map them into the warehouse. The engine did the rest of the work. I refer to this first generation as schema-driven ETL. It was developer-centric, focused on preprocessing (aggregating and blending) data at scale from multiple sources to get it uniformly into a warehouse for business intelligence (BI) consumption. Large companies spent millions of dollars on these first-generation tools that allowed developers to move data without dealing with the myriad languages of custom applications.
This first generation became a multi-billion dollar industry.
Second Generation: Less Cloudy Skies via iPaaS
Over time, consolidation in the database and application world created a more homogeneous, standards-based world. Organizations began to wonder if ETL was even necessary, now that the new world order had done away with the fragmentation that has spawned its existence, and a small number of database/application mega-vendors remained.
But a new challenge replaced the old. By the mid-2000s, the emergence of SaaS apps added another layer of complexity to the data-movement challenge. The new questions were: “How do we get cloud-based transaction data into warehouses? How do we synchronize information between different cloud applications? Should we deploy integration middleware in the cloud or on-premise or both?”
As the SaaS delivery model proliferated, customer, product, and other domain data became fragmented across dozens of different applications with inconsistent, overlapping, or redundant data structures. Because cloud applications are API-driven rather than language-driven, organizations had to rationalize across the different flavors of APIs needed to send data between these various locations.
The cloud forced data-movement technologies to evolve from analytic data integration, the sweet spot for data warehouses, to operational integration, featuring data movement between applications that increased the pressure on the system to deliver trustworthy data quickly.
This new challenge led to the emergence of integration Platform-as-a-Service (iPaas) as the second generation of data-movement tools. These systems, such as those provided by both legacy ETL vendors and newcomers like Mulesoft and SnapLogic, featured a myriad of cloud application API-based connectors, a focus on data quality and master data management capabilities, and the ability to subscribe to data movement as a cloud service. It’s important to note that these systems are still schema driven and primarily devoted to the “developer productivity problem” of simplifying design of data flow pipelines.
This second generation continues to contribute to the rapid growth of a multi-billion dollar industry.
The Need for a Third Generation: Feeding and Digesting Big Data
Of course, these first two generations were architected before the emergence of Hadoop and the big data storage-compute revolution, which we are in the midst of today. This new world loads additional requirements onto the previous generations of ETL due to the following characteristics:
- The changing nature of data sources, including log files, sensor output, and clickstream data that often arrive in a multi-structured form. What’s important about these sources is that data source systems and the data they produce are also constantly undergoing mutations, which is called data drift. Data drift leads to data loss and corrosion, which, in turn, pollutes downstream analysis and jeopardizes data-driven decisions. A white paper on the subject is available here, if you’d like to explore data drift more deeply.
- The emergence of interaction data – think clickstreams or social network activity – that must be classified as “events” upon arrival and processed accordingly. This is a higher level of complexity than what we are used to with transactional data and tends to be highly perishable, requiring analysis in real or near real time.
- Data processing infrastructure that has become heterogeneous and complex. The big data stack is based on myriad open-source projects and proprietary tools, chaining together numerous components from ingest to message queues to storage/compute and analytic frameworks. Each of these systems is on its own upgrade path, leading to the pressure of frequent infrastructure upgrades.
The combination of these factors breaks previous generations of data movement systems, which are too brittle and opaque to thrive in the big data world.
Brittleness comes from a couple of factors. First, schema is no longer static and, in some cases, is non-existent. So processes built using schema-centric systems will need to be continually and manually reworked in the face of data drift. Second, legacy systems assume stability in the data processing infrastructure and therefore tend to have the characteristics of downstream systems implied in the upstream specification (i.e., coupling). This makes it difficult to upgrade single components and forces enterprises onto the path of a painful “galactic upgrade,” where none of the stack can be touched unless the entire stack is upgraded.
Opaqueness comes from the developer-centric approach of first- and second-generation solutions. Because the standard use case was batch movement of data from highly stable and well-governed sources, runtime visibility was not a valued concept. The real-time nature of data consumption, which requires continuous operational visibility, was still in the future.
The Third Generation: Performance Management of Data Flows
To address these new challenges, a modern data-movement system needs to think in terms of data drift, continuous data flows, and the performance management of data in motion. Similar to its use in other fields, by performance management we mean the systematic monitoring and measurement of the data flow in order to ensure it meets its design goals. At a minimum, these goals would encompass the continual availability and integrity of the data in motion.
Such a system should have the following qualities:
It should be intent-driven, meaning that at design time you should have to define only the minimum set of conditions you care about in the data. In a world where schema can change without notice, minimizing specification requirements reduces the chance of data flows breaking or data being dropped.
It also should be aware of and responsive to data drift. It should be on the lookout for changes to data structure and patterns in the data that may indicate a semantic drift. Ideally, it would take automated action based on these changes, or at least provide an early warning.
Because we are dealing with real-time interaction data, the system should provide complete operational visibility and control over the data flow so that quality issues can be detected and dealt with either automatically or, at least, quickly. A set-and-forget approach is simply mismatched to today’s requirements. This type of operational visibility replaces opaqueness with transparency.
At the infrastructure level, because modern data processing environments are more heterogeneous and dynamic, third-generation solutions should employ a “containerized” architecture, in which each stage in a data flow is logically isolated and 10 percent independent of its neighbors. This allows for zero-downtime upgrades of any of the myriad components that make up your infrastructure.
From Classical to Jazz
To employ music as a metaphor, if the move from the first to the second generation was like adding more and more innovative instruments to an orchestra, the shift from the second to the third generation is akin to moving from scripted “common practice” classical music to improvisational jazz. The first transition changes old instruments for new, but you are still playing similar compositions. The current transition throws out the sheet music and asks the band to make continual, subtle, and unexpected shifts in the composition on the fly and without losing the rhythm.
In the new world of drifting data, complex componentry, and real-time requirements, there is really no choice but to embrace a jazz-like approach, channel your inner Miles Davis, and shift to a third-generation mindset when it comes to data in motion. Just because sources and data flows are dynamic and chaotic doesn’t mean they can’t be blended into beautiful music that’s good for the soul.
Girish Pancha is the founder and CEO of StreamSets. Girish is a data industry veteran who has spent his career developing successful and innovative products that address the challenge of providing integrated information as a mission-critical, enterprise-grade solution. Before co-founding StreamSets, Girish was an early employee and chief product officer at Informatica, where he was responsible for the company’s entire product portfolio. Girish also previously co-founded Zimba, a developer of mobile applications providing real-time access to corporate information, which he led to a successful acquisition. Girish began his career at Oracle, where he led the development of Oracle’s BI platform.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.