We are living in a golden age of predictive analytics and its constituent techniques of data mining and machine learning. New and ever-more clever algorithms are being invented by researchers from diverse backgrounds, and they are made available in software packages that have a variety of options, interfaces, and costs. Instruction in the methods and software is also available through formal degrees and short courses, as well as various Internet videos and other resources.
Modern Predictive Analytics (MPA – the “Modern” refers to the use of automated methods) is thus an eminently viable basis for the practice of data science, which combines techniques, software, and data acquisition with domain-specific knowledge to help real people – not just analysts – solve real-world problems.
And therein lies a major difficulty of data science practice. Not everyone in the MPA loop is up to date on the new vocabulary and capabilities of predictive analysis. A business executive with a numerical problem is unlikely to specifically ask for a regression or tree, preferring to describe the operational or marketing aspects of her problem. (“I need to find a cause for this uptick in cellular handset problems. I don’t care exactly how you do this, just make sure it’s useful and I can sell it to my boss.”) And sometimes when such executives do incorporate MPA ideas in their narrative, they are often dated, simplistic, or irrelevant. (“Do a linear regression of Overall Satisfaction on Customer Service Satisfaction in this survey, and make sure its R2 is 90 percent!”)
Therefore, perhaps the single most important aspect of the MPA/Data Mining process is not automated and perhaps not automatable: the intelligent translation of the narrative of a business problem into an analyzable project. The technical MPA requirements for this can be quite modest: a few regression, tree, and association tools go a long way in problem solving when the provisional goal is to get to, “Yes, I can help you.” Of course, an ultimately good solution may well involve complicated, perhaps dirty data and an array of sophisticated algorithms, but the hard part is getting the project started so all parties are comfortable with its initial direction.
The translation part is akin to literary criticism in the sense that we are constantly asking why our business narrator uses certain sets of words, and what they evoke in her and in us. For example, in trying to calculate a customer’s Tier—a desirability index—from just his monthly payment rather than the more complicated notion of profit, consider this quote from a business speaker: “Profit depends on many accounting assumptions and is hard to convincingly calculate. We think that profit is closely related to revenue alone and using only this information to calculate the threshold for a customer’s Tier would save my department a lot of money and time every month.” The attentive data scientist would likely downplay the first sentence, focus on the phrase “closely related,” and visualize a simple scatterplot of revenue versus profit color coded by Tiers, from which several quantitative solutions (as suggested by the word “calculate”) would present themselves.
Some experienced analysts, of course, perform this translation naturally, but for others I have found it helpful to develop the two sides of the needed translation. Listening for words and phrases and their careful consideration is one side. The other side is to build, or reinforce, images and other thought fragments of what a predictive solution would look like: a scatterplot, low-dimensional function, or concise rule, for example. If the translation process is having someone “ring a bell” in your mind, then the bell must be made to produce its sound, and we have to find a context in which the sound is meaningful.
Both the listening and the application of MPA techniques can be fun, but is there any financial value to these complementary exercises? A few years ago, a landline phone company president asked how to identify electrical cable that was anecdotally thought to be trouble-prone, with the words, “When does this stuff go bad?” With a scatterplot of repairs as a function of cable age and a logistic regression-like model, we found that the existing cable did not, in fact, go bad with age and that most of the troubled cable had long-since been replaced. Consequently, the president got to use his replacement money on other projects more important to him and to his customers. The savings: $30 million, at a cost of a few days of analyst time and a few hundred dollars of software.
The golden age of MPA, indeed.
Dr. James Drew teaches data mining at Worcester Polytechnic Institute and has been an internal statistics and data mining consultant at Verizon Communications for 30 years. Jim holds two post-graduate degrees in mathematics from Cambridge and a Ph.D in statistics from Iowa State.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.