Editor’s note: This article is the fifth in a series examining issues related to evaluating and implementing big data analytics in business.
It should not come as a surprise that in a big data environment, much like any environment, the end-users may have concerns about the believability of analytical results when there is limited visibility into trustworthiness of the data sources. This has driven the continued development and maturation of processes and tools for data quality assurance, data standardization, and data cleansing. In essence, data quality is generally seen as a mature discipline, particularly when the focus is evaluating datasets and applying remedial or corrective actions to data to ensure that the data sets are fit for the purposes for which they were originally intended.
In the past five years or so, there have been two realizations that have, to some extent, disrupted this perception of “data quality maturity.”
The first is the recognition that datasets created for some functional purpose within an organization (such as sales, marketing, accounts payable, or procurement to name a few) are reused in different contexts, particularly for reporting and analysis. The implication is that data quality can no longer be measured in terms of “fitness for purpose,” but instead must be evaluated in terms of “fitness for purposes,” taking all downstream uses and quality requirements into account.
The second realization, which might be considered a follow-on to the first, is that ensuring the usability of data for all purposes requires more comprehensive oversight. Such oversight should include monitored controls incorporated into the system development life cycle and across the application infrastructure.
The Discipline of Data Governance
These realizations lead to the discipline called data governance. Data governance describes the processes for defining corporate data policies coupled with the organizational structures that include data governance councils and data stewards put in place to monitor, and hopefully ensure compliance with those data policies.
Stated simply, the objective of data governance is to institute the right levels of control to achieve one of three outcomes: identify data issues that might have negative business impact; prioritize those issues in relation to their corresponding business value drivers; and have data stewards take the proper actions when alerted to the existence of those issues.
When focused internally, data governance not only enables a degree of control for data created and shared within an organization, it empowers the data stewards to take corrective action, either through communication with the original data owners, or by direct data intervention (that is, “correcting bad data”) when necessary.
Naturally, concomitant with the desire for measurably high quality information in a big data environment is the inclination to institute “big data governance.” It is naive, however, to assert that when it comes to big data governance that the approach to data quality is the same as traditional approaches.
That is because when we examine the key characteristics of big data analytics, the analogy with the conventional approaches to data quality and data governance starts to break down. Consider the general approach to data quality, in which levels of data usability are measured based on the idea of “data quality dimensions,” such as:
- Accuracy, referring to the degree to which the data values are correct;
- Completeness, which specifies the data elements that must have values;
- Consistency of related data values across different data instances;
- Currency, which looks at the “freshness” of the data and whether the values are up-to-date or not;
- Uniqueness, which specifies that each real-world item is represented once and only once within the data set.
These types of measures are generally intended to validate data using defined rules and catch any errors when the input does not conform to those rules. This approach typically targets moderately-sized datasets, from known sources, with structured data, with a relatively small set of rules. Traditionally-sized operational and analytical applications can integrate data quality controls, alerts, and corrections, and those corrections will reduce the downstream negative impacts.
The Difference with Big Datasets
On the other hand, big datasets don’t exhibit these characteristics, nor do they have similar types of business impacts. Big data analytics is generally centered on “mass consumption” of a combination of structured and unstructured data from both machine-generated and human sources, and much of the analysis is done without considering the business impacts of errors or inconsistencies across the different sources, or where the data came from, or how frequently it is acquired. Big data applications look at many input streams originating within and outside the organization, some taken from a variety of social networking streams, syndicated data streams, news feeds, preconfigured search filters, public or open-sourced datasets, sensor networks, or other unstructured data streams. Such diverse datasets resist singular approaches to governance.
Another issue: the development and execution model for big data applications. When data analysts develop their own models in their private “sandboxes,” they often bypass traditional IT and data management channels, opening greater possibilities for inconsistencies with sanctioned IT projects. This is complicated more as datasets are tapped into or downloaded directly without IT’s intervention.
A third, and probably the most difficult issue, is the question of consistency. When datasets are created internally and a downstream user recognizes a potential error, that issue can be communicated to the originating system’s owners. The owners then have the opportunity to find the root cause of the problems and then correct the processes that led to the errors. With big data systems that absorb massive volumes of data originating externally, there are limited opportunities to engage process owners to influence modifications to the source. On the other hand, if you opt to “correct” the potential data flaw, you are introducing an inconsistency with the original source, which at worst can lead to incorrect conclusions and flawed decision-making.
Four Key Concepts to Big Data Oversight
So to some extent what might be called the standard approach to data governance cannot be universally applied to big data applications. And yet there is definitely a need for some type of oversight that can ensure that the analytic results are believable. One way to address the need for data quality and consistency is to leverage the concept of data policies based on the information quality characteristics that are important to the big data project.
This means considering the intended uses of the results of the analyses and how the absence of control on the sourcing side of the information production flow is mitigated by the users on the consumption side. This approach requires four key concepts for data practitioners and business process owners to keep in mind:
1. Consumer data usability expectations. Since end-users may be subject to change in a big data analytics project, ascertain these expectations through a combination of techniques. For example, solicitation of requirements from the known end-users, coupled with some degree of speculation and anticipation of who the pool of potential end-users are, what they might want to do with a dataset, and correspondingly, what their levels of expectation are. Then it is important to establish how those expectations can be measured and monitored, as well as the realistic remedial actions that can be taken.
2. Consistency of metadata, in which there is some definition (and therefore control) over concept variation in source data streams. Introducing conceptual domains and hierarchies can help with semantic consistency, especially when comparing data coming rom multiple source data streams.
3. Repurposing and reinterpretation of data, keeping in mind that any acquired dataset may be used for any potential purpose at any time in the future. Establishing some limits around this broad scheme may be necessary when it comes to determining what data to acquire and what to ignore, which concepts to capture and which ones should be trashed, the volume of data to be retained and for how long, and other qualitative data management and custodianship policies.
4. Critical data quality dimensions for big data. This last item must look at what is controllable within the big data environment, and which of the defined dimensions is relevant to the business, such as data currency, or unique identifiability.
Data governance is still somewhat in its infancy, and it is challenging to attempt to adapt a collection of organizational frameworks designed for a controllable environment to a big data world in which there are limits to the amount of control that can be exercised over the data. Future articles will look more closely at big data policies and how they can be organized to help institute a level of trust in analytical results.
David Loshin is the author of several books, including Practitioner’s Guide to Data Quality Improvement and the second edition of Business Intelligence—The Savvy Manager’s Guide. As president of Knowledge Integrity Inc., he consults with organizations in the areas of data governance, data quality, master data management and business intelligence. Email him at firstname.lastname@example.org.
Home page photo of New York City Marathon by Wikipedia user Martineric.