How a Small Data Error Becomes a Big Data Quality Problem

by   |   September 3, 2012 11:08 am   |   1 Comments

Steve Sarsfield of Talend on Big Data Quality

Steve Sarsfield of Talend

The “butterfly effect,” an idea from chaos theory in mathematics, refers to the way a minor event—like the movement of a butterfly’s wing—can have a major impact on a complex system like the weather. The movement of the butterfly wing represents a small change in the initial condition of the system, but it starts a chain of events: moving pollen through the air, which causes a gazelle to sneeze, which triggers a stampede of gazelles, which raises a cloud of dust, which partially blocks the sun, which alters the atmospheric temperature, which ultimately alters the path of a tornado on the other side of the world.

Enterprise data is equally susceptible to the butterfly effect. When poor quality data enters the complex system of enterprise data, even a small error—the transposed letters in a street address or part number—can lead to revenue loss, process inefficiency and failure to comply with industry and government regulations. Organizations depend on the movement and sharing of data throughout the organization, so the impact of big data quality errors are costly and far-reaching. Such data issues often begin with a tiny mistake in one part of the organization, but the butterfly effect can produce disastrous results, making its way through CRM, ERP, billing, data warehouse and other enterprise systems.

A Cascading Spelling Mistake
For example, if two call center workers enter the same customer address as “25 Main St.” and “25 Mian St.,” the typo could have a cascading effect. Poor big data quality hurts a company’s revenue, operational efficiency and regulatory compliance.

Revenue: Without accurate data on customers, an organization can’t achieve revenue goals. Poor data quality most often affects the ability to reach customers and meet their needs. It harms efforts to maintain accurate customer records, including purchase histories, and thus could lead to missed sales opportunities. Poor data quality can lead to communication mistakes that harm customer satisfaction levels.

Operational efficiency: Untrustworthy data results in wasted time and resources. In the Main Street example, a typo could potentially lead to wasted money on mailing costs. Such errors force an organization to check and recheck facts and figures before making decisions. Poor big data quality also prevents data from being easily shared by others in the organization.

Compliance: Poor data affects the ability to comply with government regulations such as Sarbanes-Oxley, Basel II, the Do Not Call Registry and HIPAA. Lack of compliance can lead to unnecessary fines and levies.

Data on the Move
When data enters the corporate ecosystem, it rarely stays in one place. Consider the typo in the customer address as it travels throughout the enterprise. Marketing accesses the data in the CRM system to reach customers. A successful campaign results in orders that will have an impact on shipping, billing, supply chain, customer support and other systems. Finally, a manager will want to aggregate all campaigns and orders to a data warehouse for reporting. If the data enters the ecosystem as incomplete, incorrect or duplicate, many systems are affected.

There are a number of tools and techniques that can help prevent big data quality issues from leading to big consequences:

  • Data Profiling enables a solid understanding of big data quality before an organization implements a major CRM, ERP, billing or data warehouse project. This can prevent surprises—like identifying late-stage big data quality issues just before a system is ready to go live. Data profiling can also check the completeness and integrity of name, address, e-mail, web URL and other client data. Predefined reports for data profiling tools can watch for violations of data quality thresholds, such as the Main Street typo.
  • Reference Data is used to standardize parts and descriptions as a “look-up” table. Many industries have standard ways to designate parts and descriptions. For example, a logistics company cannot reliably deliver customer freight without accurate route, vessel and location data. Big Data Quality tools can apply conformity to these standards.
  • Data Standardization is not only integral to data quality, it’s integral to the effectiveness of master data management, CRM, ERP and many business applications. Users may also opt to standardize data shapes too. In an ERP or supply chain system, for example, a company may decide to always designate part numbers as NN-AAAAA, where N is a number, and A is an alphanumeric character. In this scenario, part number 12-HGAJS would be valid, while 12HGAJS_2 would not.
  • Matching Technology goes a step further than standardization, helping to find duplicates and recognize households and other relationships in your customer records. There are two common types of matching technology: deterministic and probabilistic. Deterministic, or “rules-based” matching, is where records are compared using fuzzy algorithms. The various algorithms allow for a little bit of “slop” in data, so that if there are typos or phonetic similarities (like “ph” and “f”), the algorithms can identify linkage. With probabilistic matching, the algorithm is smart enough to know that a common last name like “Jones” should play a smaller role in matching as compared to a less common last name, like “Jimmerson.” How does it know? Probabilistic matching technology performs statistical analysis on the data and deciding the frequency of items and then uses that analysis to weight the match.
  • Monitoring. Data quality issues are rarely one-and-done. Regular monitoring of key data quality metrics, with common examples such as free of error, completeness and consistency, ensures that reports are accurate and that marketing materials and invoices make it to their destination in a timely manner. Each organization can determine its own crucial data quality metrics that may impact important business processes. By reviewing the metrics on a regular basis and following their trends, a company can follow the evolution (improvement or degradation) of the quality of its data through data profiling. This helps build alignment and highlight areas of improvement.

 

Data Quality and the Organization
When faith is lost in the integrity of the data, it’s hard to generate enthusiasm for ideas and campaigns because projects seem (and often are) doomed to failure.

Conversely, trustworthy data motivates people to harness the information in new ways, giving rise to fresh ideas. Data quality is not just about saving money—though there is a tangible payback when eliminating inaccuracies and duplication from information systems. It’s about creating new opportunities by harmonizing the data from disparate systems and providing executives and business teams with quality data.

Steve Sarsfield, product marketing manager for data governance and data quality at Talend, is the author of The Data Governance Imperative and blogs at Data Governance and Data Quality Insider. Follow him on Twitter at @SteveSarsfield.








Tags: , , ,

One Comment

  1. Posted August 14, 2016 at 8:24 am | Permalink

    Useful and relevant

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>