Avoiding Pitfalls in Predictive Analytics Models

by   |   June 19, 2013 6:32 pm   |   0 Comments

Venkat Viswanathan of  LatentView.

Venkat Viswanathan of LatentView.

Predictive analytics will almost always enable better business decisions. While customer behavior may have changed due to the current economic climate, the value provided by a predictive model will always be better than that driven by instinct or gut-feelings.

But there are also many pitfalls associated with predictive analytical modeling that can seriously and wrongfully skew the results. For example, wrong or out-of-date data going in can result in an incorrect prediction coming out. So can using irrelevant data points or giving them too much weight in the overall consideration. Mistaking correlation for causation is yet another common pitfall. A related misjudgement is investing too heavily in analytic tools. Companies often buy expensive and complex software that is just too sophisticated for their needs.

Related Stories

The value in predicting when an oil pump fails.
Read the story »

San Francisco recruiter’s predictive analytics target tech talent.
Read the story »

Guide to predictive analytics.
Read the story »

More in Data Informed’s Predictive Analytics section.
Read the story »

Enterprises can avoid such pitfalls through a careful analysis of the data combined with modeling techniques that do not place too much significance on variables that may in fact be noise. Chief among these is ensembling, or averaging several models rather than relying only on one model, and thereby reducing the significance of the predictors in any one model. This has the effect of cancelling out the noise.

The majority of predictive project failures come from data errors or data weighting mistakes wherein the wrong data elements are inappropriately deemed important. This is true whether it is a big data or small data project. One simply cannot arrive at a correct answer if the beginning dataset is flawed. Nor can one arrive at the correct answer if data noise is given equal importance as the data facts in the equation.

Identifying the Right Patterns
Take the example of gifts, and the opportunity to analyze a shopper’s online purchasing behavior. A customer who orders gifts for other people does not have an enduring interest in those items or related items. Failing to distinguish between which items are ordered as gifts and which items are ordered for a customer’s own use can lead to the wrong conclusion in what the customer is likely to buy in the future. That error is compounded when the same distinction of gift versus a self purchase is not made in a demographic or other defined group as the predicted behavior will be false for the group just as it is for the individual.

The same can be said of a widow who used to purchase items for her now deceased husband, a mother who buys for her children who are growing up and now have changing interests, or a man who lived on the East Coast where he needed snow blowers and sump pumps but has since moved to the West Coast where he finds surf boards and light clothing to be of more use.

Another ever-present hazard in predictive models is the danger that correlations of marketing data in a development dataset are merely noise and not predictors of future behavior. There are some well-known examples of chance correlations associated with political events. For example, from 1952 to 1976 when an American League team won the World Series, a Republican would win the U.S. Presidency. A similar more recent example involving the Washington Redskins was true until the election of 2004. Since 1936 and up until 2004, every time the Redskins won the home game prior to a presidential election, the incumbent party won the presidency as well. Conversely when the team lost, so did the party in the White House. In 2004, the Redskin’s reliability as an elections predictor crashed. The Redskins lost but the incumbent President George W. Bush was re-elected anyway.  In essence, correlation is not causation.

Separating Signal from Noise
There is a huge difference in acknowledging pitfalls and being able to spot them.  You can never call out a true predictor with absolute certainty. But there are several approaches that you can use to avoid using noise variables and to lessen their impact if they are selected by a modeling algorithm. There are ways to identify which variables are more likely than others to be meaningless and also to lessen the impact that the “garbage” has on a model.

One way is to clean the data before you use it. But not all data cleaning methods are created equal. Some target only easy- to-detect data noise and overlook data elements that require a more sophisticated approach to detection. You need to use data cleaning methods that detect a larger defined universe of noise.

If you define noise too narrowly then the filter does not catch the majority of noise out there. Think of it this way: If a person uses noise-cancelling headphones that are set only to cancel out the sound of human voices, they will still hear other noise, such as dogs barking and glass breaking. Now, maybe the sound of glass breaking is important to hear in some cases such as when small children are around and under this person’s supervision, but in other cases, it is just background noise, say in a restaurant where breaking glass is of no consequence to what the person is doing or expected to do. Therefore, by detecting and filtering out a larger but well-defined universe of noise, that is more things that are not relevant to the situation or problem at hand, the focus remains on what is relevant. But if those headphones are set on one or two obvious noise settings, there will be plenty of noise to deal with. The same is true of data noise detection and filtering.

Prioritizing the Relevancy of Data
Further, you can also use signal taxonomy, which means an ordered classification of the data points, to prioritize relevancy.  A clear classification strategy and a quality layer overlay, which is yet another layer prioritizing the categories that have more weight or meaning to the question at hand, should be used to filter the signals further.

Classification strategies commonly used by organizations include classifying data according to confidentiality or security requirements. Another is to classify data according to use such as accounting, marketing, customer service and personal data, among others. Organizations use further classifications to narrow or cross-reference data for other uses and specific users.

A quality layer overlay can accomplish more in this regard, for example by establishing the integrity and commercial value of the data. Combined, the layers essentially establish the viability, usability and dependability of the data and thus of the predictions’ outcome too.

Using our example above of the widow who used to shop online for her deceased husband, layering personal data may reveal the death of her husband; or a layer of purchase dates cross-referenced to types of products may show a marked cessation of certain types of purchases which could indicate a loss of interest in that type of product.

Similarly, an overlay of shipping addresses data may show that the gift buyer’s shipping address for that item is outside the norm of her usual shipping address requests, thereby denoting a gift for another person rather than her own interest in the product. Or, it could show that she sends products to that address on the same date every year, denoting a birthday, anniversary or other annual gift-giving event.

An Ensemble of Predictive Scores
Another effective approach to building strong models in spite of the preponderance of data is to lessen the impact of any one variable selected by a model by averaging or “ensembling” the scores of several models. For example, using our example of the widow and the gift buyer above, if ensembling modeling is used, the techniques are used to combine the predictions from a group of models as opposed to relying on the predictions of a single model. The resulting average would likely expose the truer prediction by essentially de-emphasizing the widow’s or the gift buyer’s outlier purchases.

Within any individual model, the emphasis is to make it satisfy a certain threshold using as few inputs as possible.

However, we have found that most of the strongest models that we have created use a large number of input variables and lessen the impact any one of them can have through averaging.

Venkat Viswanathan is the CEO and founder of LatentView, a business analytics services firm. The company has offices in Princeton, N.J., Mumbai and Chennai, India.

Tags: , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>