In this era of big data analytics, nearly everything we do – from the websites we visit to the places we window shop to the movies we watch in our homes – leaves a virtual trail. When it seems your every breath is being recorded (which, for some people, it is) and that it’s impossible to escape the notice of Big Brother, it’s easy to forget that most of the massive volumes of data that are being gathered are never even analyzed.
This collected but unexamined data is commonly known as “dark data.” Gartner defines dark data as “the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes.” IDC estimates that up to 90 percent of unstructured data, for example, is dark data.
But not using that data doesn’t absolve companies of the responsibilities associated with holding the data, or protect them from the risks that can lurk within dark data.
Data Informed spoke with Peter Vescuso, Chief Marketing Officer at VoltDB, about dark data, why companies hold on to it, and how to mitigate the risks that come with dark data.
Data Informed: Why does so much of the data being collected end up as dark data?
Peter Vescuso: Dark data is data that companies collect and store, often in data lakes, with the intention of analyzing it for value and insight at some future point. Often this historical analysis doesn’t take place – in some cases because the data is unclassified or lacks appropriate tags. A form of dark data that often goes untapped is the “perishable insights” contained in live flowing data. For example, a click-event or geolocation from a live mobile user is valuable while the user is on the phone in front of a particular store. If analysis and action isn’t done in the moment, the opportunity to interact with the user is lost forever.
What types/sources of data tend to end up in the dark?
Vescuso: Among the types of data called “dark data” are what EMC and IDC call “transient data” – network routing data, web traffic data, unregulated email communications, and streaming data from TV and entertainment services. “Perishable insights” from live data – location, a trigger vent, etc. – go dark quickly.
Why do companies continue to hold on to data that they have not found a use for?
Vescuso: Storage is inexpensive, which has made it easy to store dark data and not think about it. There’s also concern about regulations. Some companies are waiting for better analytic and BI technology, some don’t have big data skill sets, and many organizations are culturally accustomed to saving everything, in the belief that all data has some value.
What types of costs are associated with holding on to dark data?
Vescuso: There are many types of costs associated with holding onto dark data – lost revenue, missed opportunities, IT costs, storage costs, energy costs. Data center power consumption alone is a significant cost – in 2012, the New York Times reported that as much as 90 percent of the energy used by data centers is wasted.
There’s also significant cost in migrating dark data from one storage architecture to another, or to the cloud or between cloud providers.
What are some of the factors preventing companies from putting dark data to use and realizing value from it?
Vescuso: Many companies lack skilled data analysts; this compounds the problems of lack of tagging and data classification that would enable companies to explore dark data for high-value data points. More important, however, are failures in preventing dark data in the first place by adopting a fast-data approach.
How can companies shed some light on dark data and begin putting it to use?
Vescuso: Know what data is important to your business. Increasingly, this is “live” operational data about interactions with customers, or performance of your operations. For many companies, live data represents opportunity flowing past in real time, but “dark,” or untapped.
What are the risks associated with dark data? What steps can companies take to minimize these risks?
Vescuso: Dark data is certainly risky – just think of email, and the liability stored email represents from a legal perspective. Compliance with regulations governing archiving of email is key in many industries, notably health care and financial services. Email is typically unstructured data, adding to the risk. Companies must develop policies to deal with dark data, but a first step is preventing dark data by performing analytics on incoming streams of data and taking action in real time.
According to Gartner, by 2018, regulatory disclosures that are related to a failure in the organizational information risk control environment will see a rise of 50 percent. Gartner also predicted that, through 2018, 90 percent of deployed data lakes will be useless as they are overwhelmed with information assets captured for uncertain use cases.
Does dark data have a shelf life? Is there a point after which the costs and risks of keeping this data outweigh the potential benefits?
Vescuso: Data certainly has a shelf life. Start assessing your risks by evaluating your exposure to regulations governing data retention. From a cost perspective, review energy costs, IT skills required to maintain data, and network, compute, storage, and software costs.
Scott Etkin is the editor of Data Informed. Email him at Scott.Etkin@wispubs.com. Follow him on Twitter: @Scott_WIS.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.