Data Profiling Builds Crucial Foundation for Data Projects

by   |   October 5, 2012 5:35 pm   |   0 Comments

The goal for collecting and storing data is to use it to gain business insights and build value.

But without a solid data foundation, insights from analytics won’t be as accurate, and the value could be questionable. That foundation is built on data profiling: taking stock of what is stored in the database and how complete those records are.

Gordon Hamilton, data quality consultant.

Gordon Hamilton, a data governance consultant based in the Vancouver area, said data profiling needs to be done early and often to keep any data sources fresh and accurate for applications such as dashboards and analytical reports.

“Data profiling is looking at the data to see what actually exists, compared to what the logical data model says that’s in there,” Hamilton said. “A data profile shows reality, the data model shows wishful thinking, and it might even been historical wishful thinking. Data models are seldom kept up to date.” The results from a data profiling project can guide business intelligence teams to base their decisions on what is actually in the database, instead of what plans say should be there.

Related Story

Group establishes data quality professional certification.

Read more»

Hamilton said doing a data profile should be as common as updating a company’s balance sheet, and should be done ahead of any new data project.

Because the end result of any data-based project is for business aims, getting business users involved is early is crucial, according to data governance expert David Loshin. In a paper for data management software provider DataFlux, a unit of SAS, Loshin wrote that the planned business outcomes can often dictate which data profiling methods should be used. He writes:

Data Profiling Methods

According to David Loshin, there are four methods commonly employed by data profiling tools:

  • Column profiling, which provides statistical measurements associated with the frequency distribution of data values (and patterns) within a single column (or data attribute).
  • Cross-column profiling, which analyzes dependencies among sets of data attributes within the same table.
  • Cross-table profiling, which evaluates relationships and intersections between tables.
  • Data rule validation, which verifies the conformance of data instances and data sets with predefined rule.

Source: The Practitioner’s Guide to Data Profiling, A DataFlux White Paper prepared by David Loshin.

“By clarifying specific analysis processes and techniques … that use data profiling technology, the analyst team can establish a well-defined scope with specific goals (such as documenting metadata or identifying potential data errors) that can be achieved in a timely and predictable manner.”

Data quality issues can often appear in a data migration, such as moving an important business system like CRM or ERP from one database to another. Problems with data quality can cause some serious delays. Without profiling the data to see what is actually stored instead of what is supposed to be stored, project managers can find a nasty surprise waiting for them when the data for critical system isn’t standardized for a new database.

“That can cause things like late-stage product delays, cost overruns and cancellations,” said Steve Sarsfield, a product marketing manager at Talend.

Sarsfield said data project managers usually work in five steps:

  • Gather project requirements
  • Write a project plan
  • Program and implement the plan
  • Go live
  • Maintain the project over time.

Sarsfield, who is the author of The Data Governance Imperative, said project managers ought to include a data profiling piece in each of those five steps.

“When you’re doing the project, make sure you have time to go in and handle all the data quality issues that you may have, and then when actually the project goes live, you may want to also be ready with ways to mitigate any data issues you have,” he said. “That’s what the smart project managers do, they look at the whole project plan and bake data profiling and data quality into the entire five step process.”

Talend offers data profiling tools as part of their data integration and management software package; several companies from industry giants to small software developers offer some sort of data profiling tool. Gartner has listed SAS’s DataFlux, Trillium, SAP, IBM and Informatica as market leaders in the field.

Source: Analyst Ted Friedman, for Gartner Research. Published Aug. 8, 2012.

But despite plenty of tools in the market, Hamilton said he doesn’t believe that business intelligence teams evaluate data profiles often enough, or communicate with business users to determine what is actually reflected in the data store.

“Even if data profiling is done, I don’t think it’s done with as much enthusiasm or as early or as often as it should be,” he said. “I see data profiling as part of this long manufacturing process, where you get this raw material … and then you put it through data profiling at a rough level, just to get an idea of [whether] it can be reported on and if there can be logical conclusions drawn from it, or if it’s just pure chaos. It’s a continuum, right from the first sampling of the raw ore into the agile report development.”

Email Staff Writer Ian B. Murphy at Follow him on Twitter .

Tags: ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>