With all the discussion of managing enormous data volumes, and the analytics challenges and opportunities that data presents, there are questions about whether ETL—the traditional process of extract, transform and load—is still relevant. Last year, for example, researchers at Gartner encouraged IT managers to think more broadly about the issue, saying in their market report on data integration tools that these technologies have supplanted traditional ETL. And tools that were once described as ETL have morphed into technologies designed to do much more.
As the big data trend itself shows, the names of technology categories can evolve; it is the business problems organizations are working to address that matter. With growing volumes, variety and velocity of data to manage, and rising demand to glean insights from that data, consultants say that ETL-style tools work well for some adapting their existing analytics systems to new trends, while others will gravitate to tools that access data from the open-source Hadoop distributed file system.
Adam Jorgensen, president of consultancy Pragmatic Works, sees the issue as an outgrowth of the style of analytics which new data sources are inspiring. While ETL was focused on moving data from point A to point B and preparing it for consumption, the focus with big data streams is increasingly on analyzing and extracting information quickly from data at an aggregate level.
“We work with a TV channel that ingests 3 terabytes of data every day, mostly from Web logs; there is no way they could aggregate that in a traditional BI solution. And they can’t go in and use traditional ETL to get that [level of] abstraction because their ETL tools aren’t up to the task,” he says.
Instead, says Jorgensen, he sees clients relying on open-source tools such as Hive, a SQL-like query and data warehouse system for Hadoop, Pig, a data flow language and execution framework for parallel computation also used with Hadoop, and Python, a programming language often used for Web-based applications. They are using these tools to comb through raw data and return useful information like the hourly click rate of a website by page, which in turn can be injected into a traditional BI tool for further analysis.
Marie Goodell, senior director of product marketing at SAP Information Management Solutions, says that just because an enterprise is dealing with more data, “you don’t necessarily need to change how you managed it in the past.” Goodell says, for example, that is unwise to assume that that something like Hadoop will answer all of an organization’s data extraction, transforming and load demands.
“There are some misconceptions in the market, that if you use MapReduce or Hadoop that you have no need for traditional ETL tools—and that’s not the case at all. Actually, both tools can be quite complementary,” she says. While Hadoop has some data movement, existing data integration (DI) tools offer filters and joins all ready built in, she adds.
It’s also true that technology and traditional data warehouse giants including SAP, IBM, Oracle, Teradata and Microsoft have made data management connections to Hadoop-based systems available, or have announced plans to do so.
An Echo of Past Scalability Challenges
Phil Russom, research director for data management at The Data Warehousing Institute, says the issue of managing big data is one of several “scalability crises” to cast a shadow over IT over the years.
“We often think it is big firms, with big web properties, that are the ones with big data issues; but big data has existed for years from sources other than the web,” Russom says. For example, telecommunication companies accumulated vast numbers of call detail records—which amounted to tens of terabytes even back in the 1990s.
When they were first developed in the 1990s ETL tools were, in effect, data mart generators, and some vendors even called them that, he says. They were built to do transformations but with relatively small data sets over relatively long time frames. Then, users began to want to consolidate to fewer data marts and move toward data warehouses, in part out of a desire to achieve “a single version of the truth.” So, suddenly ETL tools needed to handle much larger volumes of data in order to encompass the whole data warehouse, leading too a first “crisis” in Russom’s view. In that case, vendors ramped up their offerings and users threw more hardware at the problem.
Then, around the turn of the millennium, Russom says many organizations began to deploy far more applications—often connected with e-business—and that meant a lot more data.” At that time, he notes, a lot of the focus was on finally pushing traditional business processes into digital or online modes. “Once you do that you can’t help but create more data,” he says. That, in turn, led to a great increase in the capabilities needed from ETL. In response to that scalability crisis, Russom says all the leading vendors rewrote their ETL software to allow it to take advantage of parallel processing, a technology that has since become the cornerstone for providing both speed and scalability. “Because in the early 2000s, vendors made those critical investments, and that means we are in good shape now to handle the needs of big data,” he says.
In a sense, then, big data is an echo of the earlier millennial data crisis, according to Russom; although many of the companies moving to the Web back then didn’t initially see the value of the volumes of data new applications and websites generated, that has changed. Now it is a new competitive arena, “they all want to look at click stream data to see what is being read, what is most popular and to help understand how to better interact with customers,” he explains.
Russom says those kinds of realizations came early for big e-commerce sites and have become a key to profitability, through ETL and analytic processes. A similar process has only recently come to “low end” websites and businesses handling smaller volumes.
Another source of big data, increasingly in need of the ETL treatment, is machine-to-machine (M2M) data generated by robotics and manufacturing activities. Robots no longer confine their work to assembly; they also are involved with QA and testing, notes Russom. So, if a sensor detects a problem with a weld, it may even direct another robot to make a repair.
The result of all this robot chatter is lots of data to be analyzed – if it can be properly prepared and handled. One of the largest sources of M2M data is the vast world of personal electronic devices – cell phones, smart phones, and iPads. Complex to manufacture and complex to test, the industry “generates billions of rows of QA data,” notes Russom. “If you can analyze all that data, it tells you a lot about suppliers and the efficiency of the assembly line,” he notes.
Will this ever expanding universe of data eventually “choke” ETL? No, says Jerry Irvine, CIO of Prescient Solutions, an IT consulting firm, because the tools will always get better. They are becoming more user-friendly and less programmatically limited, he says. “They have nowhere to go but up as common interfaces are developed to the social networking world and other environments,” he says.
Indeed, Irvine predicts that ETL tools will be able to “crawl” across Web applications and mobile apps because they will have a standard format and can grab data more easily. And over time, ETL applications will accrue industry standards and gain power.
At the end of the day, companies managing data need to focus should on the business decisions they need to make, adds Trish Harman, director of product marketing at SAP Information Management Solutions. “The fundamental principles of good information management don’t change just because the volumes of data increase,” Harman says. “If anything, those fundamentals are even more important.”
Alan R. Earls is a business and technology writer based near Boston.