There has always been an uneasy truce within large organizations between those who control access to data – the IT group, usually – and those who need that data to improve business performance. In a perfect world, the IT group would like to see a single source of truth manifested in master data management (MDM) and the enterprise data warehouse (EDW).
Let’s consider MDM. A paper by Wilbram Hazejager of DataFlux Corporation (acquired by SAS in 2000) notes that MDM’s origins go back to the early 2000s. Its proponents did – and still do – see MDM as the way to solve the problem of disparate, disjointed data spread across different lines of business.
Nevertheless, according to Gartner, the majority of MDM initiatives fail. There can be many reasons for this. But one reason is simple: To succeed, MDM demands strict adherence to data-governance policies by everyone in the enterprise, all the time. That’s not very realistic.
But the effort to implement MDM, even if only partially realized, reinforces IT’s role as the gatekeeper of enterprise data. Rapidly growing supplies of data make it all the more difficult to streamline the data supply chain that delivers raw material for analysis to business users. And it puts IT in the unenviable position of trying to deliver more data sets, faster, while the greater enterprise population yearns for data democracy.
The Enterprise Data Warehouse
Along with MDM, the enterprise data warehouse also represents a legacy approach to handling critical business data. Large and expensive to maintain, the typical EDW fulfills a narrow, often application-specific purpose. Moreover, data architects must use extract, transform, and load (ETL) tools to add data to an EDW, which consumes substantial time and money. Simply adding a row of data to an EDW could take months.
Shadow IT, Hadoop, and EDWs
Escaping the confines of IT’s grip on enterprise data has pushed many a business unit into the netherworld of shadow IT, a term often used to describe information-technology systems and solutions built and used inside organizations without explicit organizational approval. These solutions often leverage the cloud. It doesn’t take much to deploy a Hadoop cluster in the cloud and start filling it with data, more or less on the sly.
This is not to say that most corporate deployments of Hadoop are “off the books.” They are not. In fact, getting off the ETL treadmill has been one of Hadoop’s main selling points for large enterprises. Hadoop stack vendors have focused most of their marketing dollars on the notion that organizations can move some of their EDW data into Hadoop. It’s far cheaper and more flexible in terms of hardware and storage.
These vendors talk about EL – extract and load – rather than ETL. Extract the data and load it into Hadoop; transform it when necessary for a particular use case. The popularity of Hadoop as a destination for structured as well as unstructured data has spawned several SQL on Hadoop solutions, including MapReduce, Impala, SparkSQL, Presto, and Hive on Tez.
Yet there’s far too much data in EDWs for any company to consider putting all their EDW data in Hadoop. Moving a billion rows of data from an EDW takes time. It also puts a load on the primary business system that depends on the data warehouse, which can impact operations. Likewise, an EDW database can handle only so many requests before performance degrades; plus, these data migrations hog enormous network bandwidth. In other words, it’s not a trivial exercise.
So organizations have a foot in both worlds. If organizations ever move all their EDW data to Hadoop, it will be a multi-year, possibly a multi-decade process. Most knowledge about customers, transactions, and products still lives in EDWs. Right now, most enterprises use Hadoop to hold large data sets like log files or sensor data, which are massive, multi-format, and don’t conform well to the schema of an EDW. They may not be sure what value this data holds, but they want some place to put it until they figure it out. And if they want to merge that with data from a data warehouse, they’ll move a chunk of the EDW to Hadoop.
Building a Bridge with Virtualization
So on one side we have EDWs, and on the other we have Hadoop with its storage and processing layers. Clearly, organizations need a way to manage data across these systems in an intelligent way that reveals how people use data. The ideal solution would be a single access point where users can see all the data, without moving it. That way, IT admins can make smart choices about what should stay in an EDW and what should move to Hadoop. They can eliminate random expansion and duplication of data by tracking and controlling where data lives without putting up a wall between data and the users who need it.
One way to build this single point of access is through virtualization. Let data remain wherever it is and when users make a request, federate it to multiple data sources. In fact, the virtualization vendors believe the future of the data warehouse is simply a virtual view of all data across an organization. They talk a lot about “speeds and feeds,” that is, the speed at which a system can process and ingest data.
But we don’t believe that approach will ever be fast enough for applications that need to provide extremely fast or near-real-time responses. For those scenarios, there’s no substitute for putting the data, regardless of origin, in one place and using the data model that the application demands.
In our view, virtualization is not a way to replace the data warehouse. But it is a tool that can be used to help democratize data and make it accessible to everybody in the organization without having to move it and create more data silos.
Data Democratization: The Middle Ground
On its own, virtualization cannot deliver complete data democratization. Virtualization can deliver visibility, but not meaning. That requires a semantic index of all of an organization’s data sources. A semantic index operates in the middle of the data stack between where the data lives (storage) and where it’s analyzed (analytical tools), closing the gap between the top and bottom of the stack.
If you want evidence that the middle layer is where the action is in data democratization, look no further than all the analytic tool vendors that suddenly want to become stack vendors. They are adding all those “old” technologies you would get in a data warehouse – like cleansing, blending, and normalization.
The innovation wars will be won or lost in middle layer of the data stack. The middle layers hold the key to agile BI. If you are not agile in the middle, you are not agile. You can’t have an agile data supply chain if every time an analyst or business user needs a data set, he or she has to go to IT. It just takes too long.
And, as any IT shop can tell you, the number of requests is increasing by orders of magnitude. There are simply not enough bodies in IT with the necessary skills to fill all the requests for data. And there never will be.
Getting Data to the Business
If you are responsible for managing the data and BI infrastructure in your organization, nothing said here so far comes as a big surprise. You probably can sum up your biggest concerns as follows:
- Govern an increasing volume and variety of data and data sources
- Get data to business users faster while still retaining control for compliance and information governance
- Evaluate the latest technologies for their business value while dealing with an infrastructure that includes new and legacy systems
You know that your team is perceived as a bottleneck between data sources and data consumers. The process of provisioning BI tools is slow and inefficient, but finding a solution has proven elusive.
That’s why data discovery should be on the top of every IT organization’s big data shopping list. The first step in data discovery is building a semantic metadata catalog of all enterprise data sources, including MDM data stores. The catalog remains fully within IT’s control but simplifies self-service data discovery through applications like Tableau.
When you shop for a sweater on Amazon, you don’t get involved in how it handles the inventory. You just pick the sweater you want. Self-service data discovery brings that experience to data users. The semantic metadata catalog helps them quickly shop and locate the most relevant data they need to answer questions about market share, revenue, cost, customer loyalty, regional preferences, and on and on. That’s how you build a data democracy.
Stephen Baker is CEO of Attivio, the Data Dexterity Company. In leading Attivio, Baker brings more than 15 years of experience as a top executive within the enterprise software industry. Baker holds an MBA from the University of Pennsylvania – The Wharton School as well as a BS in Music and Marketing from Hofstra University.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.