How Data Profiling Improves Management of Storage Media, Risks, Costs

by   |   September 9, 2013 4:00 pm   |   0 Comments

Jim McGann

Jim McGann of Index Engines

Storage in the average data center grows 40 to 60 percent per year, reports leading analyst firm IDC, forcing IT departments to increase capacity and deal with clogged networks.

Now IT professionals can reclaim valuable storage capacity and control costs within their data center by leveraging an emerging technology called data profiling.

Data profiling software, reminiscent of information lifecycle management software that made headlines last decade, processes all forms of unstructured files and document types, creating a searchable metadata index of what exists, where it is located, who owns it, when it was last accessed and optionally what key terms are in it, creating a one percent index of the original data after pulling the data from every email and file on a server or tape set.

Related Stories

Developing a legal risk model for big volumes of unstructured data.
Read the story »

Three ways flash memory can accelerate data center architecture.
Read the story »

Why data virtualization is good for big data analytics.
Read the story »

SanDisk addresses data access latency with solid state drive cache.
Read the story »

High-level summary reports of the automatically extracted metadata then allow instant insight into enterprise storage providing knowledge of data assets at levels not known in the past. Through this process, mystery data can be managed and classified, including content that has outlived its business value and files owned by ex-employees now abandoned on the network.

The process relies on an enterprise class index of metadata from user files and email databases such as last modified or accessed time, number of duplicates, size, owner, location, file type and more. Indexing occurs at unprecedented speed and efficiency, tackling environments that measure data in petabytes. Integration with Microsoft’s Active Directory allows added intelligence to make decisions about active and inactive users and more.

The profile, when put to use, delivers comprehensive knowledge of unstructured files and email so decisions can be made on the data’s disposition, mystery data can be exposed and retention policies can be formed.

This technique also has a risk management benefit. Data profiling also can look beyond metadata for compliance assurance or breach mitigation by going deep within documents and email for keyword searches or even personally identifiable information audits for sensitive content such as Social Security or credit card numbers.

Using this technology, the three biggest culprits of wasted storage capacity and IT budgets—user-shared content, network storage and backup tapes—can be managed. All three issues can be cleaned and moved to proper archives, less expensive storage or even purged from the system.

User Shares and Network Storage
Unstructured user data has reached massive volumes in the data center. Terabytes, even petabytes, of end-user files are stored on primary storage without any intelligence or knowledge of the content.

Large portions of this data are typically made up of files such as ex-employee data that has not been years and has no business value, or large personal multimedia files that users have shared on these servers and are taking up valuable space.

Aged files and those owned by ex-employees that have long outlived their usefulness can easily be wiped from the network. Continuing to grow storage capacity to allow this out-of-control growth to continue is a massive and needless waste of IT budget.

As most data ages, it naturally loses its business value. The average organization can purge duplicates, personal media files like iTunes and aged data with no impact on the business.

Data profiling also finds all of the movie and iTunes files downloaded and backed up on the network, identifies who owns them and enables the ridding of this unnecessary content. Done on an enterprise scale, the elimination of unneeded content reduces the growth of storage—and the costs associated with that growth.

Backup Tapes
Many organizations have accumulated thousands, even hundreds of thousands of backup tapes through the normal course of business.

Mergers, acquisitions and consolidations have accelerated the growth of these repositories causing offsite storage budgets to reach outrageous levels. These offsite storage fees are being paid despite the fact that the contents of the tapes are unknown due to lost records or are inaccessible because the platforms they were created with are no longer in use.

Backup tapes can be safely remediated once relevant content, including anything required for compliance or eDiscovery purposes, is extracted and preserved.

Using data profiling, unstructured legacy backup tapes can be efficiently processed, indexed and individual files and emails extracted based on policy. Data profiling technology eliminates the need for the original backup environment, and extracts a metadata index of the contents that can be searched and reported. And the reduced size of these tape libraries can put a significant dent in a company’s offsite storage budgets.

An Illustrative Example
Through this data profiling process, mystery and legacy data that was once difficult to understand can be decoded and managed. Unstructured files and emails that once presented a compliance risk and a storage nightmare can be reviewed and a company can decide its proper disposition. Some will move to archive, other to less expensive storage and a shocking amount of duplicates, personal files and aged data can be purged.

One data center manager recently reported to my company that he used data profiling to analyze a 100-terabyte user share and it was determined that only 45 terabytes was unique data and 55 terabytes was duplicate content that built up over time. Of those 45 terabytes, only 20 terabytes had current business value and the bulk of the data had not been accessed in more than five years.

Data profiling software indexed the environment and delivered file-level insight into his data and enabled the data center manager to perform a quick analysis of a server due for an upgrade, showing that only 20 percent of the data was actually of value and the bulk could be purged. Without profiling the information that was purged would have been stored indefinitely – wasting storage capacity and budget.

Jim McGann is vice president of Index Engines, a provider of enterprise information management and archiving products. Prior to joining Index Engines in 2004, McGann worked for software firms including Information Builders and the French-based engineering software provider Dassault Systemes. He can be reached at

Tags: , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>