Imagine, for a moment, that the current presidential campaigns did not have access to demographic or social details of the United States. No knowledge of voters’ racial and ethnic background; cannot target the campaigning by education or income level; no statistics on gun ownership; no stratification by age; no breakup by gender even.
The outcome would be chaos. How would you structure a campaign? How would the candidates know what’s important to the voters, how to position themselves, what to focus on in a particular state or town? How can anyone predict election results?
Without such references, how do we apply context to any conversation?
Nouns like race, ethnicity, education, income, age, and gender are important because they provide context to data, without which the data itself would hold no value. These nouns lay at the heart of all data-based planning and big data analytics, just as they are at the heart of the current presidential campaign. Collectively, these nouns embody a seemingly humble but extremely powerful concept called reference data. In a nutshell, reference data comprises relatively small (even tiny), discrete, distinct, controlled data sets that are used to qualify or stratify raw data.
Information analysts and scientists are well aware of the importance of reference data. Every year, the Association of American Medical Colleges holds a conference focused on information technology in medical education and research. The event brings together leading academics and practitioners from medical centers across North America, and it’s always interesting and inspiring to see their forays into the frontiers of medical IT. This year, I had the opportunity, along with a wonderful colleague, to present on reference data management and ontologies, but it turns out that we were hardly the only ones thinking about reference data. Several presenters spoke of the necessity to corral enterprise taxonomies, ontologies, and controlled vocabularies to facilitate data integration, analytics and – above all – data governance.
The conference brought back memories.
Way back in 2004, I had written a review of a product called Razza Dimension Server. Razza offered a methodical approach to managing reference data. The product was novel and intriguing; a pioneer of sorts in the emerging world of data governance solutions (you can read the review here). It was clear even then that I was on to something very fundamentally important. “In the fairyland of data,” I wrote in that article, “reference data is the imp: hard to control and constantly creating mischief.”
Some things don’t change. Fast forward a dozen years, and the product has withstood the technological ravages of time – it is thriving in the form of Oracle/Hyperion Data Relationship Manager, albeit mostly in the field of finance (e.g., to maintain the chart of accounts).
Meanwhile, enterprise vocabularies and ontologies have become more complex and deeply entrenched in the world of data. There is a growing awareness of the role of reference data and vocabularies as the “data harmonizer” in data and systems integration, reporting and analytics, and data science. Integrating disparate data sources – for example, when building a master data hub or putting together an analytic data mart – requires not only conformed data syntax (e.g., data types) but also semantics. Machine learning and natural language processing depend on data rationalization enriched by domain-specific vocabularies. Compliance reporting requires data to be aligned to common definitions. A classic example is that of gender, which at first glance may appear simple, but is a deceptively complex entity. Different systems store gender differently – “male” can be represented as “M” in one system, “male” in another, and the number zero in a third, and integrating these sources (say, different customer repositories) requires that all gender values be rationalized. In addition, the concept of gender grows more complex as we understand, accept, and process the increasing nuances of a noun that we once thought of as basically binary in value.
It is all the more intriguing, then, that even a dozen years after products like Razza hit the market, this wide awareness of the need to govern enterprise reference data is not matched by a corresponding appetite for reference data solutions or by clearly defined approaches to governing reference data. Furthermore, reference data management skills are in short supply (and equally poorly defined), compounding the problem.
In part, we can attribute that to a lack of compelling alternatives in the marketplace. Large companies in the data management space such as IBM, Informatica, Microsoft, Oracle, and SAP focus on big-ticket areas and emerging big-market opportunities – for example, databases, data integration, in-memory data management, big data. Reference data management is generally just too much of a niche to attract their sustained attention, and any capabilities they have in this area (e.g., Oracle DRM) seem almost incidental and receive limited management focus.
That leaves the market open to niche players, new entrants, and innovators, such as Health Languages, Collibra, and Diaku. Although their market reach is limited, these innovators are making increasing inroads into business mindsets and IT budgets.
A second factor is lack of centralization of the enterprise data management function, including reference data management. Without one central team with sufficiently empowered leadership to define the vision and lead the charge for data management across the enterprise, the function is effectively splintered across corporate divisions, locations, and teams. As a result, awareness and acceptance both suffer. At the organizational level, this results in process inefficiencies and inadequate data quality.
As the perception of data as a valuable asset spreads, we can hope that there is (at least partly) commensurate awareness of the need to manage data more centrally. Organizations also would benefit from creating a position of the Chief Data Officer, not just in title but with authority and accountability.
For any organization looking to derive deep and sustained value from data, it would be foolhardy to ignore the need to govern reference data.
Rajan Chandras is director data architecture and strategy at a leading academic medical center in the northeast. He is a prolific contributor to well-known industry publications and has presented at industry and research conferences. He writes for himself and not for or on behalf of his employer. You can reach him at firstname.lastname@example.org.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.