Doctors, judges, business executives, and many other people are faced with making critical decisions that can have profound consequences. For example, doctors decide which treatment to administer to patients, judges decide on prison sentences for convicts, and business executives decide to enter new markets and acquire other companies. In this current age of big data, such decisions increasingly are being supported by predictions from analytics algorithms (across all domains and sectors) that are learned from historical data.
However, with this growth in the use of analytics, predictions and other outputs of machine-learning algorithms are being presented to users who have limited analytics, data, and modeling literacy. The latest trend in artificial intelligence (AI) and machine learning is using very sophisticated systems involving deep neural networks with many complex layers, kernel methods, and large ensembles of diverse classifiers. While such approaches produce impressive, state-of-the art prediction accuracies, they give little comfort to decision makers, who must trust their output blindly because very little insight is available about their inner workings and the provenance of how the decision was made.
Therefore, it is imperative to develop consumable analytics and interpretable machine-learning methods in order for predictions to be adopted and trusted by decision makers and for the analytics to have real-world impact. It has been frequently noted, and we also find in our dealings with domain experts who have limited machine-learning literacy, that rule sets composed of Boolean expressions with a small number of terms are the most well received and trusted outputs.
For example, IBM’s SlamTracker reports keys to winning a tennis match as a simple and-rule. The predictive decision rule for Roger Federer to defeat Andy Murray in the 2013 Australian Open was as follows:
- Win more than 59 percent of four- to nine-shot rallies
- Win more than 78 percent of points when serving at 30-30 or Deuce
- Serve less than 20 percent of serves into the body.
The bullet points summarize the and-rule, and this is an interpretable way to automatically summarize the data to a non-technical audience.
(Incidentally, Federer did not satisfy any of these three conditions and lost the match.)
Another, and far less interpretable, way to arrive at the same insights would be to compute the kernel matrix evaluated at each of 10,000 sample tennis matches and the new match we want to predict. Then, take linear combinations of these kernel evaluations and compare them to a threshold. If we have 100 possible features (some of which were automatically selected for our rule), then typically all of these 100 features would be needed for each kernel evaluation.
Needless to say, it is all but impossible for a human to get insight into how a kernel-based support vector machine (SVM) classifier makes it decision. The accuracy might be higher, but the interpretability is lost, as is the likelihood of the data having any real-world impact.
In contrast with black-box classification paradigms such as neural networks, kernel-based support vector machines, and ensemble methods, Boolean rules can be interpreted easily by the practitioner or consumer and provide readily recognizable insight into the phenomenon of interest.
However, learning such rules from training data is a hard problem from the optimization perspective. Most existing rule-learning algorithms are more than two decades old and are heuristic and greedy in nature. With a principled optimization viewpoint, by formulating rule learning as an extension of an old statistics problem – group testing – and an exciting development in statistical signal processing in the last 10 years – compressed sensing, we can obtain excellent predictive accuracies with models that are easily interpreted by decision makers and nontechnical users.
The key to making analytics simple and transparent is to use variables that the decision maker understands, and to use simple formulas involving only a small number of the most important variables. Many of the more complex classification algorithms create aggregate variables (for example, these may come from principal component analysis) and nonlinear transformations, after which it is very hard for humans to relate to the decision process.
Tyler McCormick is assistant professor of statistics in the Statistics and Sociology departments at the University of Washington. Cynthia Rudin is associate professor of statistics at the Massachusetts Institute of Technology and associated with the school’s Computer Science and Artificial Intelligence Laboratory and the Sloan School of Management. Dmitry Malioutov is a research staff member in the Machine Learning group in Business Analytics and Mathematical Sciences (BAMS) department at the IBM T.J. Watson Research Center. Kush Varshney is a research staff member in the Mathematical Sciences and Analytics Department at the IBM Thomas J. Watson Research Center.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.