How’s this for a data analytics challenge? Consider the thousand or so variables that an operating jetliner records every second. Add reports written by pilots and others in the air traffic system. Multiply that by nearly 10 million flights a year in the U.S.
Your task: extract information from this mountain of data—much of which is in unstructured text files—so you can predict and prevent safety problems.
That mission is the responsibility of Ashok Srivastava, Principal Scientist for Data Mining and Systems Health Management at NASA. It’s a daunting challenge, but Srivastava’s team has made enough progress that Southwest Airlines uses the NASA technology in the company’s operational safety program. The airline and NASA have been working together since 2008 on the data mining project.
Predictive analytics is an emerging focus in the data analytics field. It’s all about taking massive datasets and looking for precursors to interesting events, said Srivastava. “In our case it’s an aviation safety event, but in other applications it could be looking for predictors for a medical event or it could be looking for predictors of a change in the stock market,” he said.
The information is there. The question is developing the right tools to uncover it. In 2010, Srivastava conducted a demonstration of the potential for data mining for safety applications, publishing a NASA analysis of flight data which uncovered instances of a type of mechanical problem—excessive wear of the threads on a critical nut—that caused the fatal crash of an Alaska Airlines flight in 2000.
The NASA project uses text analytics—algorithms that automatically identify useful information in text documents. Text analytics is a big part of the big data trend because text data falls outside the bounds of traditional information management tools like relational databases, and because there’s a lot of it.
The NASA data miners analyze very large text data sets in the hunt for factors that might contribute to aviation safety incidents. “Let’s say you have 100,000 reports that talk about different things going on in the aviation system,” said Srivastava. “People might be talking about engine problems, they might be talking about problems understanding signage in an airport, or they might be talking about confusing runways,” he said.
Srivastava’s team is developing machine learning algorithms that can identify patterns and spot anomalies in large text-based data sets. One of the team’s key algorithms—a multiple kernel learning algorithm—combines information from multiple data sources, such as numerical and text data.
The NASA project focuses on data sets on the order of 10 terabytes. “We picked that number based on the number of flights that are occurring within the United States and current computing power,” said Srivastava. The team will scale up to larger data sets as the size and complexity of real-world data sets increases, he said.
NASA regularly transfers its technologies to the Federal Aviation Administration, and they use the algorithms on much larger, much more complex data sets, said Srivastava. NASA is also sharing the technology with the aviation industry, including Southwest Airlines, he said. Many of the algorithms are open source and available on NASA’s DASHlink site.
“The algorithms we’re developing can discover precursors to aviation safety incidents. We’ve already seen that happen,” said Srivastava. As NASA refines the algorithms and deploys them on real systems, and as it shares them with air carriers, new trends will be discovered, said Srivastava. “And some of those might have safety consequences,” he said.
Parsing Text for Sentiment Analysis
Text analytics isn’t just for big government agencies or life-and-death issues. It’s rapidly emerging as a valuable marketing tool. “Text has never been more interesting than it is now with the huge volume of text-based data that exists in social networking sites,” said Jamie Popkin, managing vice president and Gartner Fellow Emeritus at market research firm Gartner Inc.
The overwhelming amount of online content in general—social network sites, wikis, blogs, user forums, e-commerce sites—is text-based. A major thrust in commercial data analytics is correlating information derived from text analytics with data from transaction systems, said Popkin. For example, a company might link business intelligence output from a data warehouse with text data from a customer service center or with things that people are saying on social networks, he said.
There are two approaches to text analytics: linguistic models and machine learning, said Popkin. The linguistic approach uses natural language processing to attempt to understand the meaning of the text data. Machine learning algorithms like those NASA is developing identify patterns in text-based data. “Most people are finding that there is a hybrid approach: the combination of the machine learning and the linguistic models,” he said.
A hot topic in text analytics is sentiment analysis—figuring out from people’s written words what they like and dislike. “I want to know whether you like a particular feature and whether that like of that feature is something that might drive your intent,” said Popkin.
One use of sentiment analysis: understanding which preferences push individuals to make purchasing decisions. Sentiment analysis also measures strength of sentiment. “How much do you like this, how much do you hate this, how much would this affect your opinion on something,” said Popkin.
Sentiment analysis is also emerging as an important political tool. “Which candidate do you like, what aspects of the candidate do you like, how strongly do you feel about certain positions being taken by one candidate versus another,” said Popkin. And the all-important question, “will this affect your voting position,” he said.
Eric Smalley is a freelance writer in Boston. He is a regular contributor to Wired.com. Follow him on Twitter at @ericsmalley.