When a pharmaceutical company moves a new drug from phase 2 testing to phase 3, the development can have big implications for a rival company’s plans. Spotting this valuable nugget of information in an obscure journal or news article amid the mighty flow of online information is a major data analytics challenge.
Data management software vendor Cambridge Semantics is taking an unusual approach to this text analytics problem. They are tapping a set of standards-based open source tools that were originally designed to make the Web as a whole smarter. The company’s Anzo platform is based on technology from the open source Anzo Project, which in turn is an implementation of the W3C’s Semantic Web standards.
According to Cambridge Semantics, five of the top 10 pharmaceutical companies are using its Anzo platform for competitive intelligence tasks. Company officials declined to identify the pharmaceutical companies, but Cambridge Semantics lists giants Biogen Idec and Merck as customers.
The Anzo software combines data from private databases and spreadsheets with data from curated information sources like Thomson Reuters Life Sciences and unstructured data from news stories, press releases, blogs, and scientific journal articles. The software unifies these disparate forms of data using the Semantic Web technologies. This makes it possible to automatically connect, for example, an item from a news article with an item in an internal company database.
The main pieces of the Semantic Web are the Resource Description Framework (RDF), SPARQL Protocol and RDF Query Language (SPARQL), and Web Ontology Language (known by the acronym OWL). RDF is a model for metadata: the data about data including classifications of terms that make it easier to organize and access information. SPARQL is a database query language for accessing information in RDF format. OWL is a tool for creating ontologies, which are sets of related concepts.
“Enterprises continue to struggle to know what data they actually have and what it really means,” said Sean Martin, Cambridge Semantics’ chief technology officer. “This is increasingly difficult given the myriad of distinct—and growing number of—data sources, the endless diversity of data models and formats, and the relative inflexibility of the software tools,” he said.
Semantic Web technologies can help solve these problems, Martin said.
The W3C’s Semantic Web standards facilitate describing, integrating, querying and reasoning about data at a conceptual level, said Martin. “These standards were designed from the very start to play nice together and at Web scale,” he said.
The standards can be applied to structured data by direct mapping techniques and to unstructured data using natural language processing techniques, said Martin. “You never know what type of fact is going to leap out of a sentence in a document,” he said. “You don’t want to have to ignore it because you didn’t expect it up front.”
Challenges for the Semantic Web Approach
Not everyone is sold on using Semantic Web technologies for text analytics.
For text analytics, scalability is a hurdle, said Seth Grimes, an industry analyst and consultant and organizer of the Sentiment Analysis Symposium. So is getting at all the information to be found in sources, and accounting for the context in which it appeared and for the user’s needs. Another hurdle for text analytics is that no single source or type of data gives a complete picture, he said. “The Semantic Web is wholly incapable of addressing these hurdles,” said Grimes. “The Semantic Web just wasn’t designed for comprehensive analytics.”
The issue boils down to using a single platform—Anzo, in the case of Cambridge Semantics—versus tailoring a custom system from a choice of tools. “The biggest challenge is to find technology that fits your business problem and to apply it right,” said Grimes. “The key is semantic technologies, not adherence to limited-scope Semantic Web technologies nor SQL database orthodoxy nor any other data-management dogma,” said Grimes.
Text analytics uses natural language processing to extract entities and metadata like title, author, publication date, as well as topics, facts, events, relationships, sentiment, and opinions, said Grimes. All this information is an important part of business intelligence, data mining, and automated text processing initiatives, he said. There are dozens of text analytics products on the market, including offerings from IBM, HP and SAP, said Grimes.
Irrespective of the particular technology, text analytics is a difficult problem, Grimes said. “It’s hard to get machines to understand human communications, and even harder to join data that originate in disparate sources and to detect signals that aren’t even apparent until you fuse those sources.”
Expecting the Unexpected
Lee Feigenbaum, vice president of technology and client services at Cambridge Semantics, said Semantic Web technology’s flexibility makes it appropriate for the text analytics challenge because it can accommodate unpredictable entities, relationships, and events. “You’re going to be dealing with extremely ragged and unpredictable information coming out of the text analytics processes, and you’re going to need something with the flexibility of RDF and SPARQL to deal with that heterogeneity,” he said.
The idea that Semantic Web technologies are limited is based on the common misconception that Semantic Web ontologies need to be developed in a top-down fashion—that they need to be created first and then data mapped to them, said Feigenbaum. “While that’s the case sometimes, it’s also the case that ontologies can evolve in a bottom-up fashion, driven by the data itself.”
Eric Smalley is a freelance writer in Boston. He is a regular contributor to Wired.com. Follow him on Twitter at @ericsmalley.