Analytics Lessons from Spy Work: Machine Learning Applied to Unstructured Data

by   |   December 4, 2012 4:26 pm   |   0 Comments

When it comes to extracting valuable information from vast seas of unstructured data, the U.S. government’s intelligence services have been there, done that.

Now the vendors that supply data analytics tools to the government are bringing their expertise to the commercial sector as big data becomes established in the business world. The analytics used by the government can provide a big competitive advantage for many businesses in applications that range from sentiment analysis and market research to risk management and data security. There are also important caveats, and companies need to be careful when drawing lessons from the public sector’s data analytics experience.

Much of what the intelligence community uses its copious hardware resources for is semantic analysis of text: sifting through records about relevant people, places and events — and the relationships among them — from billions of documents, news reports and social media posts. Twitter hit 400 million tweets per day this year and Facebook surpassed one billion registered users. Making sense of this torrent of human-generated text requires clever algorithms as well as the brute force of data centers and cloud computing.

Text Mining for Meaning
Text mining algorithms are at the heart of the commercial data analytics software used by government agencies. The algorithms extract entities: people, places and events. The more sophisticated the algorithms are, the deeper their understanding of human language. Many of them tap natural language processing to extract relationships and context.

Related Stories

Analytics and the drive for new market opportunities.

Read more»

Business problems suited to big data analytics.

Read more»

Improving the effectiveness of customer sentiment analysis.

Read more»

Some of the algorithms are designed for specific tasks. For example, Recorded Future has a system that extracts events in time from human-created text on the Web. “We basically extract people, places, organizations and the events and time points that they are involved in, and then make that available for analysis,” said Christopher Ahlberg, Recorded Future’s CEO. “You can ask questions like ‘who is likely to travel to Greece over the next month’ or ‘what events are going to unfold in South America over the next two weeks’,” he said.

Some of the algorithms are derived from years of natural language processing research. “The computer has to understand words. It has to understand nouns. It has to understand verbs. It has to understand tense. It has to pull those things out and connect them together into concepts or events,” said Peter Coddington, CEO of inTTENSITY, a company that makes semantics-based language analysis tools.

Other algorithms are based on years of research in machine learning. The idea is for the system to automatically model data and build ontologies and rules, and learn from experience. The more data the system is exposed to, the smarter it gets, “and therefore, the more productive that you would get with it,” said Tim Estes, CEO of Digital Reasoning. “You have to choose between a strategy where you map what the world means or [one where you] learn what the world means,” said Estes.

“The intelligence community understood long ago that language is subtle and must be understood in context,” said Estes. “This means that any solution that requires a fixed ontology or rules programming is too inflexible and expensive to be effective. Machine learning and automation is a key requirement of an enterprise solution,” he said.

“Big data essentially means we’ve gotten so good at communicating and transferring information and replicating it digitally that it’s substantially outstripped our ability to organize it, analyze it, make sense of it, and take action on it,” said Estes. “Understanding the data is the answer, and that can be done with better algorithms,” he said.

Relationships, Correlated
Intelligence services traditionally use data analytics for geopolitical analysis — understanding the capabilities and intent of countries — and network analysis — understanding networks of bad guys: who’s connected to whom, what they do and where they operate, said Ahlberg. At first glance, this doesn’t seem relevant to the commercial sector, but businesses are beginning to adapt the technology for competitive intelligence, he said.

Intelligence Players

IT heavyweights IBM, Oracle and SAS sell data analytics tools to all manner of government clients. Here are some smaller software companies that specialize in high-end analytics for the intelligence services.

  • Basis Technology: text analytics for social media and electronic discovery.
  • Datameer: data analytics and visualizations designed for large datasets and Hadoop.
  • Digital Reasoning: entity-oriented analytics engine for automated text analytics.
  • Ikanow: open source document-centric analytics platform.
  • MarkLogic: searchable databases for unstructured data.
  • Orbis Technologies: enterprise semantic search, text and data analytics.
  • Qbase: geographic search and business intelligence tools.
  • Recorded Future: intelligence analysis for Web-based content.
  • Thetus: multi-source analysis tools for unstructured data.

The capabilities of the technology are fairly broad, and businesses are beginning to use it in different ways, said Estes. “We’ve seen more convergence as we’ve gotten into more markets,” he said. “It’s fundamentally about human communication and the questions that you want to answer from it. Who talked to who, where and about what? Who knows who? Who is the best person to get in touch with someone? Where are they going to be?”

A key use of the technology is harvesting information for business intelligence applications. The speed of business has increased and agility is extremely important, said Stefan Groschupf, CEO of Datameer. “Time is the only thing you can’t buy with money. It’s really about very, very crisp targeting and messaging, and being very agile in reacting to changes in the market,” he said.

Intelligence agencies always seek to be more proactive, said Gary Bloom, CEO of MarkLogic. “They look to be able to predict events in order to be able to either prevent them or manage them to best effect.” In business, competitive advantage often lies in the ability to spot trends that can be capitalized on, he said.

Businesses may be less likely than intelligence services to use the systems to learn as much as possible about particular individuals, but the sets of relationships and facts the systems contain can also be used to find people or organizations that match defined behaviors, said Ahlberg. For example, a manufacturer could search for retailers in specific regions that carry competitors’ products and have high favorability ratings from customers.

Sentiment Analysis and Risk Management Use Cases
Businesses are also adopting data analytics applications long used by the intelligence services, most notably sentiment analysis. For example, the State Department monitors social media to understand what the citizens of a country think of its government. “It’s very important to understand the sentiment of populations that are openly talking on Twitter or Facebook,” said Coddington. “It’s like a constant polling tool.”

Businesses are embracing sentiment analysis to monitor their brands. “The United States of America is, in essence, a brand,” said Coddington. “A lot of people want to know what people are thinking or doing or saying about us worldwide,” he said.

Managing risk is a large part of what the government uses data analytics for, and that’s a natural fit for business. Recorded Future has a series of new customers who use the technology for corporate security, including monitoring for cyber threats and threats to their supply chains, said Ahlberg.

A significant amount of work in the intelligence community is focused on developing a complete understanding of a person in place and time, which is similar to fraud, risk and compliance issues in the commercial sector, Estes said. ”We are trying to understand an individual as completely as possible using all sources of information available – including unstructured sources.”

The reliability and security of data analytics systems is another area where intelligence services and private sector needs align. “In the intelligence community, if you get the answer wrong, lives might be at stake,” said Bloom of MarkLogic. These agencies can’t afford to lose data or experience transactional inconsistency that would lead to inaccurate data being used, he said. While businesses don’t typically deal with life-and-death situations, system and data reliability are essential as integrating data sources becomes a mission critical part of a company’s core applications and systems, he said.

Likewise, intelligence services’ requirements for data security are stringent. “You can imagine the security restrictions that go with these cloud and Hadoop infrastructures,” said Brian Ippolito, CEO of Orbis Technologies. “Well, it turns out our commercial clients have the same problem. They need a similar ecosystem,” he said. For example, health care insurers and providers are required by the Health Insurance Portability and Accountability Act (HIPAA) to protect patients’ personal information.

A Different Signal-to-Noise Ratio in Business
While there are many similarities, there also are important differences between how intelligence services and businesses use analytics. For example, intelligence services generally have to deal with worse signal-to-noise ratios. There is less problematic and erroneous data in the enterprise, said Estes.

Intelligence services are also more often focused on finding low-probability information, dubbed needles among needles, said Ippolito of Orbis. Vulnerability assessments, for example, typically look for one-in-a-million chances of failure because the higher probability failure modes are usually already accounted for, he said.

Needles among needles are often found by tracking implicit relationships in addition to the direct connections between entities. The key to discovering relationships when one entity is unidentified is drawing inferences, which is very computationally intensive. IBM’s Watson is an example of one of the few commercial systems that can do this.

Even when they stick to direct connections, businesses gain a lot. “We’re generating regulatory reports in seconds versus hundreds of man-hours,” said Ippolito. “There’s too much written, there’s too much information, there’s too much at stake to not use computers,” said Coddington.

Eric Smalley is a freelance writer in Boston. He is a regular contributor to  Follow him on Twitter at @ericsmalley.


Tags: , , , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>