Why Media Companies Organize Data with Semantic Search

by   |   April 28, 2015 5:30 am   |   1 Comments

Tony Agresta, Managing Director, Ontotext USA

Tony Agresta, Managing Director, Ontotext USA

How many times a day do we utter, or hear someone else utter, the phrase, “Google it”? It’s hard to imagine that a phrase so ubiquitous and universally understood has been around for less than two decades. The word “Google” has become synonymous with online search, and that’s because Google yields the most relevant, comprehensive results, quickly. Essentially, it has changed the way we find and interact with content and information.

We have seen the cultural effect Google has had on search and discovery on a broad level, but consider the implications for online media and publishing organizations interested in leveraging these same powerful search and discovery capabilities, a process referred to as dynamic semantic publishing. The results can be transformative, but many companies in the space still struggle with harnessing the technology.

Semantic publishing is dramatically changing the way we consume information. It automates the process of organizing and deciding what content goes where on the web, so that news and media publishers can quickly and accurately manage content, create more and deliver a personalized user experience.

The idea of dynamic semantic publishing can be a difficult concept to grasp because its use is not readily apparent to viewers. Rather, the process is centered on the curation, enrichment, and analysis of text and how it’s organized even before users interact with it.

Semantic publishing includes a number of techniques and tactics, including semantic markup and recommendations. Through these techniques, computers are able to understand the structure, meaning, and context of massive amounts of information in the form of words and phrases.

Related Stories

Build a Better Big Data Recommendation Engine.
Read the story »

Improve the Customer Experience with Analytics and Discovery.
Read the story »

The Disruptive Potential of Artificial Intelligence Applications.
Read the story »

Use Predictive Analytics to Improve the Customer Experience.
Read the story »

This can sound a lot like “tagging.” As news organizations and publishers began taking their content online, tagging was the basic process used to categorize information. Basically, when you type a term into a site’s search box, the results returned will contain that word. However, dynamic semantic publishing goes well beyond simple content tagging.

At the heart of this solution are three core semantic technologies: text mining, a semantic database, and a recommendation engine. Text mining is used to analyze content, extract new facts, and generate metadata that enrich the text with links to the knowledge base. The semantic database stores pre-existing knowledge such as thesauri and lists of people, organizations, and geographic information. It also stores the newfound knowledge and the metadata delivered from the text mining process. The recommendation engine delivers personalized contextual results based on behavior, search history, and text that have been interlinked with related information.

The text mining process operates behind the scenes and runs continuously. Sometimes this process is referred to as “semantic annotation.” In essence, it’s a pipeline of text. Articles are analyzed, sentences are split up, and entities are identified and classified in real time. The pipeline often uses related facts from other sources that already have been loaded into the semantic database. These linked open data sources help resolve identities that are the same but referred to differently. During the annotation process, relationships between entities are discovered and stored, such as those between people, where they work, live, travel, etc. All of the results, known as semantic triples or RDF statements, are indexed in a high-performance triplestore (graph database engine) for search, analysis, and discovery purposes.

This knowledge base is extended with key terms and related concepts, all of which are linked to the original articles or documents. Oftentimes the pipelines encounter entities that require disambiguation. This is crucial to avoid confusion between, for example, Athens, Georgia and Athens, Greece or the Federal Bureau of Investigation and the Federation of British Industries.

The final result is richly described content, all of which is interlinked and stored in the semantic repository.

To ensure the text mining algorithms are kept up to date, editorial feedback is collected, allowing machine learning to automatically adapt and retrain the algorithms. This way, the knowledge base becomes smarter as publishers prepare to deliver on the promise of personally targeted content.

When this data is combined with web visitor profiles and user search history, the recommendation engine takes over. Massive amounts of data are analyzed to determine and suggest the content most likely to be of interest. This magical blend of profiles, history, structured entity descriptions, classified facts, relationships, and enhanced knowledge delivers a wonderful user experience. Visitors are automatically delivered highly relevant content.

The same approach can be applied to matching readers with highly relevant advertisements. Semantic technology can deliver extremely targeted ads to the right people and profiles, allowing publishers to charge more for ads. This mix of customized content and advertising also drives more clicks and, in turn, more time spent on a site.

Cleary, solutions like this must be able to scale. At the same time that hundreds of queries per second are taking place to serve requests on your website, authors are also enriching new content, which is then committed to the database and available for the next search. As writers create content, they are prompted with related news and facts they can use in authoring. This instant feedback correlates directly with author productivity. If anyone sees something misclassified, they can correct it and commit the change to the underlying triplestore, leading to a smarter knowledge base that can drive recommendations.

As an example, the BBC was looking for a new way to manage its web content during the 2010 World Cup. Content included text, video, images, and data that encompassed 32 teams, eight groups, and 776 individual players. As with many publishing organizations, there was simply too much content and too few journalists to create and manage the site’s content.

The BBC implemented a dynamic semantic publishing framework to accurately deliver large volumes of timely content about the matches, groups, teams, and players without expanding on the costly manual intervention of editorial. The BBC’s experience shows the far-reaching implications of online media and publishing companies embracing the same approach.

Semantic publishing can drive real business results. News aggregators can categorize and assemble related content faster. Researchers are able to pinpoint exactly what they are searching for at record speeds. Decision makers can accurately assess performance with visibility into the timing and volume of content read. Writers are informed in real time and produce more content. Web site visitors get highly customized and relevant recommendations, and advertisers benefit from increased response.

As digital publishing companies face the constant pressure to consistently produce fresh content with less editorial resources in order to remain competitive, those that embrace semantic technology will win on many levels.

Tony Agresta is the Managing Director of Ontotext USA. Ontotext was established in 2000 to address challenges in semantic technology using text mining and graph databases.


Subscribe to Data Informed
for the latest information and news on big data and analytics for the enterprise.


Improving access to data across your company/partner ecosystem



Tags: , , , ,

One Comment

  1. Posted April 30, 2015 at 10:14 am | Permalink

    thx for this overview!

    The irony is that the semantic web, which was envisioned as being a tool to open up and interconnect the web, is beginning to get traction mostly ‘under the hood’

    We’ve found that it’s the internal business case (data integration) for organizations with lots of content, that that gets them into semantic technologies. (like in the BBC example you gave)

    What are your thoughts on this?

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>