How to Convert Piles of Legacy Content into Valuable Digital Assets

by   |   January 10, 2014 6:44 am   |   1 Comments

Mark Gross of Data Conversion Laboratory

Mark Gross of Data Conversion Laboratory

Organizations collect enormous volumes of content. The stream of forms and documents can be endless – publication archives, training materials, proposals, repair manuals, photographs and other images.  The problem is that it’s difficult to manage these papers, scanned images, word processing documents and other content, and challenging to create value from them. It’s the epitome of unstructured data, with more coming in every day.

Making this content valuable is a process of first making it findable – both to you and the world—and then making it modular with components that you reorganize and reuse to meet new purposes.

There are three steps to developing a strategy to make content manageable and reusable. First you need to identify what’s special about your content and to assess how to make it useful.  The second step involves determining which pieces of content warrant converting into digital formats for reuse. And third: You need to consider a forward-looking content management strategy, to ensure that you have flexibility in the future as technologies for using content and media evolve. After all, many technologies we take for granted today, like smartphones and tablets, weren’t even on the horizon just 5-10 years ago – we don’t know what’s next.

Related Stories

A long, long time from now, in a data storage medium far from here.
Read the story »

Enterprise search vendors target unified information access.
Read the story »

How data profiling improves management of storage media, risks, costs.
Read the story »

Establish electronic discovery policies as a risk management strategy.
Read the story »

This article reviews the steps you can take to consider data formats and digitization techniques to match your content management needs. Content management shares important themes whether you are a professional society with decades of research documents or a government agency trying to sift through incoming waves of data.

Choosing Formats That Will Work for the Long Term
When digitizing content there is a continuum of possibilities ranging from simple images, to metadata, to text, PDF, and on to more sophisticated formats like EPUB and XML; each step provides incremental opportunities to add structure and value to your content. For example:

• Digitizing images. If your content is not digital, the first step is to simply scan all your content. This approach allows you a digital record of the collection, but will not support search. You need to maintain indexes and other records in order to find what you need without having to “thumb” though it, but it’s a valuable first step towards preserving content. It is also inexpensive, and if done well the process will support future steps to make the content searchable.

 Extracting searchable text associated with content. The next step would be to obtain a text version of the contents which allows the materials to be searched. For scanned images, extracting searchable text from scanned images is usually done with Optical Character Recognition (OCR) software or, when an application requires greater accuracy, the process takes place using a combination of automated enhancement techniques and some level of human review.

Extracting searchable text from most electronic files can be done directly; some electronic files are really image formats for which the OCR process would be used.

 Metadata –searchable information about the content—can be either an alternative or a supplement to full text depending on the nature of your content. Metadata can be used for making content findable. Examples of metadata include abstracts, descriptions, keywords, sources, captions, and other information that is either extracted from the page itself or from catalogs and other sources.

 Content in the Portable Document Format (PDF), a proprietary format developed by Adobe, is almost universally available, and many publishing systems can produce this format easily. The format replicates the look of a printed page, and when coming out of a publishing system would normally be full-text and searchable. However it is not easily reflowable, which means it retains its original format when displayed on e-books, smartphone, and other devices. This characteristic makes the format more difficult to use. More importantly, it is not easy to take apart PDF content to make it modular; this limits the content’s reuse potential.

 Moving content into XML, HTML, EPUB and other reflowable formats gives you maximum flexibility. There are a number of choices here. Extensible Markup Language (XML) is a standardized structured format allowing you to incorporate text, images, video and all kinds of other data. It also allows you to add structure and “non-content” information that might be needed to reformat content for e-books, the web, and the multitude of devices people use today (and in the future). XML also allows you to modularize your content so that you can reassemble the components for new products and for new purposes. XML covers a number of variations, and there are numerous variations that have been developed and are specific to niche purposes and industries. For example, NLM is a journal publishing format used for medical and other scientific researchers. HTML is a version of XML for displaying web pages and e-books, and the EPUB format is designed to support e-books.

Three Use Case Examples
To put this in context, I’m sharing the following examples from my experience with three very different organizations, each with a different understanding of what’s special about their content, and what they chose to do in practice:

Creating a new resource from archival materials.
Leaders at a technical society, publishing high quality research papers since 1917, felt that these materials were still being used for reference, and that its older papers would be used more if the society improved access to them. In particular, they thought the formulas and images in the archives would be of special importance, and they wanted maximum flexibility. While not sure of all the potential future uses of their content, they chose to go back to the beginning of their publications, and converted 750,000 pages to XML. They used the NLM version of XML which is designed for scientific publications, and used MATHML for their formulas, which is a specialized XML format to would allow them to automatically recompose formulas for different kinds of devices. Further, by tagging the images, they were able to build an enormous image databank – a totally new resource that is expected to be of great research value within that community. Future derivative products will allow them to recombine articles into special topical collections, as well as those based on historical periods and the work of notable scientists.

Turning piles of charts and photos into assets.
Decision-makers at a mountaineering society, which maintains one of the world’s largest collections of mountaineering charts and photos, felt that what was special about their data was the very fine quality of photographs and maps, with extensive metadata about geographic locations, and a collection of reports documenting climbing accidents going back a century. To keep within their budget they chose to focus the first phase on very high-quality imaging to bring out the details of their photos and maps, and to use OCR scans to make the text searchable, but to defer any further conversion to a later phase.  A possible next phase could build up the metadata of their collection by geo-tagging with geographic locations, a move which would allow them to connect all these pictures and maps to their locations—making them valuable to mountaineers, travelers, geographers, and researchers.

Making a document deluge searchable.
A government agency was faced with mountains of documents coming in from individuals and law firms at the rate of 100,000 pages every day. They were all getting converted to images on the way in, and stored in a content management system. This allowed the agency to maintain good quality paperless records, but there was no way to search through these millions of pages without the benefit of any textual search. The manual efforts consumed tens of thousands of hours of human effort, and created major workflow problems. Because the documents were complex, with images, tables, and mathematical formulas breaking up the page, regular OCR scans did a poor job. An enhanced OCR process was created to automatically redact the non-textual components, produce OCR scans of the pages, and recombine the redacted components into XML documents. This automated process allows the agency the ability to search through the corpus automatically – which was the original purpose – and also to develop quality assurance tools to verify accuracy of documents, and to recompose positions of the documents for enhanced research and display.

Three Essential Content Strategy Steps
The three examples above illustrate the different approach one can take for different types of content, and solving different kinds of issues. The keys are to understand what is important and unique about your content, and to understand what will make a difference in your organization and for your customers.

Three essential steps in developing your legacy content strategy include:

1. Assess the inventory. Until they start looking, most organizations don’t really know what they have, both in quantity and in relative importance. Assessing these factors before converting content to digital formats reveals how big the effort will be. How many pages, how many books, how many files—whatever is the measure for your content.

2. Establish priorities for the content. Some materials are needed just to maintain a record, for which images are fine, while other materials will be republished or used to generate e-books, and there may be opportunities to develop other derivative products, some of which haven’t even been imagined yet. Some materials will not be needed at all. Since there are different costs associated with the various output formats described above, it is useful to review your inventory with an eye towards both which items should be prioritized over time, and what conversion formats are appropriate.

3. Determine a path for growth. You don’t have to do a digital conversion of legacy content all at once. If you capture information thoughtfully you can upgrade over time as your needs develop and become more apparent. For example, if you capture images carefully, you can perform OCR scans at a later time to build a searchable database, and you can turn the content into XML for publishing at an even later stage. If you convert to XML, you can later use it to automatically create e-book and Web content. On the other hand, some formats are a dead end and will give you the flexibility to move forward.

It all starts with recognizing what you have and how valuable it is. We often are too busy with our day to day activities, to take the time to evaluate the hidden assets that most of our organizations have. New technology is allowing us to more effectively dig through our content assets, and to make effective use of it. Technology properly applied allows you to find new value in your legacy content. This allows you to create new revenue streams, reduce risk, and promulgate information in a way never before possible.

Mark Gross, founder, president and CEO of Data Conversion Laboratory, is a recognized authority on XML implementation and document conversion. Mark has a B.S. in Engineering from Columbia University and an MBA from New York University. He has taught at the New York University Graduate School of Business, the New School, and Pace University.

Home page image via ThinkStock.


One Comment

  1. XML to PDF
    Posted March 28, 2017 at 3:01 am | Permalink

    This has been a great read for me and I am sure many others will loved it as well, Thanks for your efforts here.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>