Back in the very early 1990s, when database applications were written in languages like FoxPro, Clipper, and Paradox, a lot of the data that fed those systems came from “extracts” – flat files from mainframe and mini computers. To read, clean, and load this data, oftentimes custom code was written to get the job done.
With the shift to the client-server architecture in the mid-90s, databases became more centralized and the procedures for interacting with them became more formal. In response, a new category of tooling emerged under the moniker of ETL – extract, transform, and load.
ETL tools allowed data transformations to be expressed graphically, and they included the “plumbing” necessary to run the transformations on a scheduled basis and manage exceptions that occurred during processing. They also worked very well for loading data warehouses from source databases, and this in fact became ETL’s primary use case.
Enter Big Data
ETL tools persist to this day, but their applicability to big data is limited. That’s because big data systems are fed with data coming from things as wide ranging as log files, sensor readings, and even digital images, sound, and video. Working with data like that simply has a different scope, and frameworks for such work are premised on less formal structures than is typical ETL.
Because of this mismatch between big data and ETL, and also because of the trend toward self-service, a new category of tools has emerged, known as self-service data preparation. The category has grown big enough to merit comparative product reports from various analyst firms and has even prompted one ETL vendor (a most iconic one, in fact) to bring its own data-prep tool to market.
Is My Data “Quality”?
Most data-prep tools tend to provide for areas of functionality heretofore catered to by modern ETL tools as well as data quality tools. The latter focus on removing outlier values; duplicate data; inconsistencies in formats, entity names, or identifiers; and “dirty” patterns in the data, like extraneous spaces or other characters.
In effect, data prep tools focus both on repairing data and on transforming it. The user interfaces in the tools focus on both, but the emphasis seems to be more on repair than transformation. In fact, for several tools, doing transformation work often involves dropping down to code, which takes us back full circle to the data-conversion coding of 25 years ago. This begs the question of which may be more important to a majority of users: data repair or transformation?
The availability of tools on the market would suggest that data repair is the more important area. But is it? A hallmark of big data is that the data sources are varied, and that schema and semantic models are dispensed with, allowing the data to be shaped differently for different analyses. Indeed, in the MapReduce algorithm that originally typified big data, the “Map” step is mostly geared toward parsing, interpreting, and transforming the data into a form ready to be aggregated by the “Reduce” step.
Transformation activities such as taking Web logs and aligning their data to user sessions, or parsing social network data to determine brand sentiment, involves heavy work on the data in one form and mapping it to something else. Transformation is the essence of big data preparation. Without it, big data would just be BI.
Right now, the scope of functionality for most standalone data-prep tools is rather narrow. It encompasses data quality and remediation, and limited transformation functionality, like splitting columns and joining tables. Such tasks are important, and user interfaces that make shorter work of them are valuable.
But it’s important to understand that functionality like that tees users up to then do more: heavy transformation work, analysis, visualization, and then further transformation work, the requirement for which is identified from that activity.
Users need to look for tools that will help them with all of this and, preferably, to do as much of it as possible in a single product, avoiding the need to shuttle between disparate tools. Does your prep or analytics tool meet all these needs? If not, you may need to supplement your toolbox.
Andrew Brust is Senior Director, Market Strategy and Intelligence, at Datameer, liaising between the Marketing, Product, and Product Management teams, and the big data analytics community. Andrew writes a blog for ZDNet called “Big on Data”; is an adviser to NYTECH, the New York Technology Council; serves as Microsoft Regional Director and MVP; and writes the Redmond Review column for VisualStudioMagazine.com.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.