ETL has come a long way in the nearly 20 years since it debuted as a concept in data warehousing.
ETL, which stands for extract, transform and load, refers to the process for taking data from original sources—such as financial reports, shipments and revenue figures—and putting them into the right format to be loaded into a target such as a database or, more likely, a data mart or data warehouse. There trained professionals could submit queries and perform further analytics.
It was painstaking work. Standards were few and far between, data quality was often poor, and most ETL was done through custom-coding, perhaps with a team of C++ programmers. The process wasn’t without problems—for example it tended to adhere to siloed data practices and still didn’t address data lifecycle issues—like how many copies of data were needed and how to keep a “single version of the truth.”
There has been a lot of progress since then. For starters, ETL is no longer “alone.” It is typically sold and deployed as part of a suite and is often far more automated than in the past. For example, Syncsort, a company that offers a data integration suite called DMExpress, that includes ETL, has roots in mainframe software developed more than 40 years ago. The picture has broadened since. The company’s software now includes an optimizer to speed processing; many different sort and join options for assembling desired data sets; a compression tool to make data handling more efficient; and a metadata interchange feature, allowing the import of jobs from other platforms, such as Informatica and IBM InfoSphere DataStage, says Jorge A. Lopez, senior manager for data integration at Syncsort. ETL is now a key part of the data integration layer in business intelligence applications, Lopez says.
ETL is not just for large enterprises. Most companies are doing ETL of some sort, says Adam Jorgensen of Pragmatic Works, a company that provides training, tools, and services for customers using the Microsoft Business Intelligence platform. “The small companies may not be using SAS or Informatica but they are doing similar, if less structured things through Excel or Access” to prepare data for analysis, he says.
Indeed, the need for ETL is less about company size and more about how central data analysis is to their business, Jorgensen says. Some companies have a small headcount but they are very much in the data business so they may have a big need for ETL even though they are mom-and-pop scale. First-time users can be companies that have $1 million in revenues, he says.
ETL Evolving with Programming Styles, Data Types
It used to be that ETL was very customized and hand-driven, with some programmers using C++ coding, Jorgensen says. Now there are many products that deliver ETL in a drag and drop style. Tasks that used to take a week to engineer—say preparing to analyze a new kind of data stream—can now be done in a few hours,” he says.
|ETL Then and Now|
|20 Years Ago||Today|
|• Custom, hand-coded||• Almost always vendor-supplied|
|• Focused narrowly on moving data||• Usually part of a software suite focused on broader data handling and integration issues|
|• Key aspect of data warehouse||• Used for data warehouses, data migration, database sorts and joins, and other processes|
|• Lived on-site||• Could-based versions emerging as an option|
The field continues to advance. On the horizon are ETL products and techniques to handle non-relational and unstructured data such as Twitter streams, text documents, emails, clickstream data from websites, and even the endless chatter of machines in manufacturing applications and elsewhere along the value chain.
Although ETL was originally oriented primarily to supporting data warehousing it now has broader roles, including migrating data to new applications, says Jerry Irvine, CIO of Prescient Solutions, an IT consulting firm based in Chicago.
For example, Irvine says ETL has been used to help eliminate “dual entry” —where multiple systems gathered similar information. Each system then became a kind of data silo. Instead, a company can use ETL tools to prepare the data to serve as a central data source, providing requested data to different applications.
Irvine says he sees small- and mid-size companies continuing to use ETL in those roles but particularly for data warehousing. “There it is used not simply for extraction, transformation, and loading but to provide a continual snapshot from disparate applications, continually grabbing new information; or as a periodic process, updating multiple times each day. It is, in short, central to many IT operations.”
Irvine says this helps explain why many database applications have ETL capabilities built-in. “They have the ability to go and grab information from other products—even collaboration tools like SharePoint or WebSphere provide some similar functions,” he says. ETL also plays a role in helping to “accumulate information for big data,” says Irvine. Nor is ETL just about stoking analytic activities. It also has a role in migrating data in general—in effect, mediating between sources and targets.
ETL Transformed: Suites on the Rise
Those observations by Jorgensen and Irvine underscore a tremendous transition in ETL technologies. The name refers to something that “isn’t just ETL anymore,” says Phil Russom, research director for data management at The Data Warehousing Institute.
Russom notes that vendors have transitioned toward suites of tools in which ETL is a key element. The transition from standalone to suite has occurred over a ten-year period, driven by the needs of users. Users were no longer just shoveling data into a single function, like a data warehouse focused on needs like marketing or financial analysis. They had developed more complex needs and maybe even a more sophisticated outlook. That’s continuing today, says Russom – especially the sophisticated outlook part of the equation.
Among the ETL-related applications experts cited include:
Data federation and data replication. While ETL is indispensable for data warehousing, Russom says, it is also vital for these functions. Examples of suites include Oracle Warehouse Builder, SAS Data Integration Studio, and Pentaho’s Kettle data integration project.
Data management. Russom says that many of the suites or platforms that incorporate ETL are moving into fields such as master data management, data quality, and meta data management. Master data management (MDM) is a set of processes, policies, standards and tools that defines and manages the non-transactional data of an organization. Meta data refers to data used to describe other data, including the location, types of media and other attributes.
Real-time enterprise applications. Carl Olofson, an IDC analyst, said the interest in real-time applications has resulted in an important shift from bulk data movement, the classic mainstay of ETL typically used for large batch jobs such as data warehouse loading, to dynamic data movement. In this scenario, ETL will continue to be important but it will be supplemented by other tools with a real-time focus, Olofson says.
Cloud-based offerings. Jorgensen says he sees potential for delivering ETL capabilities through the cloud, especially for small- and medium-sized businesses. Informatica is one such offering.
Approaching ETL with Data Integration in Mind
Russom argues that while ETL is still something that is crucial to deliver, it should always be considered and implemented within a broader view of data and data integration. That means embracing everything from BI to data warehousing as well as operational databases, in an ETL discussion, he says.
It also means joining together the oversight and management of previously separated data processes. Users typically have ended up with technology silos; with one silo for developing and managing data integration solutions; another silo for data quality; and, more recently, one for master data management, Russom says. Though all of these teams and their associated processes involve data management disciplines and touch the same enterprise data, there is little or no coordination among them. What has been lacking is an enterprise-wide focus or an enterprise data strategy taking into account all data and uses and ensuring that data management disciplines work toward common goals, following common standards, he says.
“If you want enterprise data to comply with enterprise wide standards, you should be making sure your data integration and data quality activities integrate at the right point, with mutual awareness,” Russom says. This means that data integration and data quality activities (and the people who control them) are focused on the same objectives, and that integration occurs at the point when it can be most cost effective—something that needs to be determined on a case-by-case basis, he says.
Russom says he expects the trend toward broader integration of ETL and other data functions will grow as users increase their sophistication and begin to coordinate their efforts through a master data management process. That trend, in turn, means that many customers should consider suites for ETL that incorporate a soup-to-nuts view of data.
Alan R. Earls is a business and technology writer based near Boston.