The success of any big data analytics initiative – indeed, success as a data-driven organization – begins with (and could end with) the quality of your data. It seems fairly obvious that if the data isn’t clean and reliable, everything that flows from it will be compromised, but many organizations continue to struggle with data quality, challenged by data performance-management issues including stopping bad data and keeping data flows operating effectively.
Data performance management company StreamSets released the results of a survey it conducted with Dimensional Research to understand how well enterprises believe they are managing their big data flows. Streamsets co-founder and CEO Girish Pancha spoke with Data Informed about the survey results and what organizations can do to improve data quality.
Data Informed: What was the impetus behind conducting the study? What indicated a need to do this research?
Girish Pancha: Our discussions with large enterprise customers indicate the volume, variety and, if I may propose another V, volatility of important new data sources make data flow management difficult. New sources like Internet of Things sensors and systems logs often change structure or semantics without notice, polluting downstream data.
We wanted to find out whether companies realize that these changes have occurred and how they deal with them. Every company aspires to be a data-driven enterprise, but that effort is difficult when the nature of these new data sources compromise data quality and reliability.
How widespread is this problem?
Pancha: Bad data is a huge problem. Our survey found that 87 percent of respondents admitted to flowing bad data into their stores, while 74 percent knowingly had bad data in their stores. Separately, a recent KPMG CEO survey showed that more than 75 percent of CEOs said they lack confidence in the quality of data used to make decisions.
What are some of the business risks of having bad data in your data stores?
Pancha: The major goal of just about every big company is to become a data-driven enterprise. Data drift combined with hand coding risks that your analytic insights are based on either incomplete or inaccurate data, generating false positives that trigger harmful business decisions, or false negatives that lead to a failure to act.
The long-term risk is to the big data initiative itself. If the data fails you once, you are less likely to trust it the next time. And if this becomes a pattern, you may pull back on your big data investment.
How does the growing focus on and value of real time, streaming analytics exacerbate this problem?
Pancha: Streaming analytics adds a whole new level of complexity to the question of how to derive value from data. In short, it creates a world where data is more perishable. So the margin of error for detecting and dealing with bad data is vanishing. You must catch the bad data as it comes in – you no longer have the luxury of addressing it in the data store, as time is not your friend. Another way of saying this is that the risks around using big data have shifted from data at rest to data in motion. So you need to focus on the tooling and systems to manage your data flows in as robust a fashion as you manage your storage/compute environments.
How difficult is it to detect a problem with data quality? What did the survey reveal regarding companies’ ability to detect bad data?
Pancha: I think the reason the survey respondents were so negative around bad data is that they are already fighting this fire today. In general, they know they must be shipping bad data because they spend so much time addressing problems that pop up on a continuous basis. It’s the tip of the iceberg as it were; despite their efforts, they know there is bad data that gets through.
Given the prevalence of bad data, it’s not surprising that detecting changes to data while it is in motion is the place where enterprises felt weakest, and where there was the biggest gap between current capabilities and preferred state. From the survey, we learned that only 34 percent graded themselves as “good” or “excellent” at detecting data divergence; but more than twice that percentage (69 percent) considered detecting divergent data to be a valuable capability.
How effective is data cleansing at removing bad data from data stores? Is this a reliable method to ensure the quality of the stored data?
Pancha: The survey pointed to the fact that enterprises are attempting data-cleansing operations in multiple places throughout the data life cycle: upon ingest, in the store, and as part of the analysis process.
This indicates that there is a great deal of duplication of effort because it may be that the same source data needs to be flowed to multiple data stores and consuming applications, and these flows will evolve over time in unexpected ways. Of course, in a real-time world these have to collapse into a single effort because analysis is occurring so close in time to the initial ingestion.
For these reasons, we believe that data quality should be addressed as close to the point of data acquisition as possible so you can solve the majority of issues once and as early as possible. This will not relieve the need for some further cleansing downstream, as there may be application-specific normalization that needs to occur, but it should remove basic sanitization of the data from the downstream workload.
What are some other data management challenges that the study revealed that organizations are dealing with?
Pancha: We also asked data management professionals about their challenges with big data flows on a broad level and found that the most-cited issue was ensuring the quality of the data in terms of accuracy, completeness, and consistency, getting votes from over two-thirds of respondents. Security and operational management of data flows also were listed by more than half.
In terms of operational challenges, organizations have trouble getting visibility across their entire data systems holistically. Many cite the need for a centralized control center and bemoan the fact they don’t have any sort of tool that provides that bird’s-eye view of their data flows. Finding the tools to do that, implementing them, and making sure they can derive value from today’s data volumes is difficult.
What can enterprises do to meet these widespread data management challenges? Is this a technology issue? A talent issue? Both?
Pancha: First, it’s an issue of focus. We need a mindset shift toward operational management of the entire data lifecycle, not just data at rest. Currently, data flows are managed in an ad hoc manner. We advocate setting up a center of excellence for your data flows that extends the focus from data at rest to the entire life cycle. As enterprises attempt to harness bigger and faster data, it’s imperative that we build continuous data-operations capabilities that are in tune with the time-sensitive, dynamic nature of today’s data.
Before big data, most of your risk was in ensuring proper operation of the data warehouse, a storage/compute issue. The initial years of the big data era have been similar, focused on standing up Hadoop clusters. So we have had a serious focus on data at rest and have assumed that the data would just arrive reliably and with quality. But the action and the risk is moving to data in motion, because getting the data to the store is complex and becoming more so. And when you talk about real-time data and streaming analytics, you are inherently talking about a continuous operation that must be actively managed.
This shift in mindset has three legs: people, process, and technology. As for the first two legs, a center of excellence for data flows would be staffed by people who, in many cases, are already in the enterprise but perhaps with a shift in their mandate toward the data flow problem. In many cases, the talent is already in the organization. It’s more a matter of proper focus.
With regard to technology, executing an operational approach to data movement requires different tools that focus on KPIs across the path of the data flow in addition to the performance of each component in the path. So there is a technology component to this.
What is the key takeaway from the study?
Pancha: In short, enterprises are struggling to manage their big data flows, and a key consequence of this struggle is that bad data pollutes the data stores that feed consuming applications. This results in reduced efficiency, slower time to insight and, ultimately, poor business decisions. Enterprises are aware of the issues and want to improve their performance. They desire new systems and tools to help them.
What tips or advice can you offer to organizations to help them ensure that their data stores are clean, up to date, and reliable?
Pancha: First is to recognize that creating continuous clean data is an operational process, not a periodic activity. This fact has implications for how you manage the responsibility. The biggest tip we can offer is to work on shifting your mindset to a performance-management point of view and setting up a center of excellence around data flows.
Second would be to engage with business users to understand their quality requirements in terms of both accuracy and timeliness. Some applications that act on gross signals can survive some bad data, and users performing only periodic analysis have more time for cleaning. The specific requirements of each business process will inform the SLA parameters for each data flow and guide investment in technologies that let you set, detect, and enforce these rules.
Third would be to adopt tools purpose built for big data, as that is where the data-quality risk is highest due to data drift. The right tools will collect data-quality measures upon ingest and even automatically deal with some data-drift cases.
Overall, companies need a new organizational discipline around performance management of data flows, with the goal of ensuring that next-generation applications are fed quality data continuously.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.