Interactivity, Non-normality, and Missing Values: How to Address Common Data Challenges

by   |   July 21, 2016 5:30 am   |   0 Comments

Editor’s Note: This is the third in a four-part series on improving analytics output with feature engineering. Click here to read all the entries in the series.

 

In Part 1 of this series, we looked at an overview of some popular tricks for feature engineering. In Part 2, we looked at a few of those tricks in greater detail. In this part, we will continue that deeper dive, with a closer look at more of the tricks outlined in Part 1.

The examples discussed in this article can be reproduced with the source code and data sets available here.

Transforming Non-normal Distribution to Normal

A number of statistical analysis tools and techniques require normally distributed data. Many times, we may need to treat data to transform its distribution from non-normal to normal. A number of techniques, ranging from statistical tests to visual approaches, can be used to determine if the data is normally distributed. We shall use the histogram, which is an easy visual approach to check for normality (Figure 1).

Figure 1. Determining distribution normality with histograms. Click to enlarge.

Figure 1. Determining distribution normality with histograms. Click to enlarge.

 

A histogram indicates normal distribution when it resembles a bell. The graph on the left in Figure 1 could be indicative of normal distribution because its histogram resembles a bell (as compared to the other graphs in Figure 1).

Some of the ways to transform the distribution of the data involve taking its log, square root, or cube root, etc. But iterating over all possible transformations could be extremely time-consuming. In such cases, a Box-Cox transformation could be the best bet.

Box-Cox. Statisticians George Box and David Cox developed a procedure to identify an appropriate exponent – Lambda  (λ) – to use to transform data into a “normal shape.”Part 3 EquationsThe Lambda value indicates the power to which all data should be raised to, to transform it to normal. In order to do this, the Box-Cox power transformation searches from Lambda = -5 to Lambda = +5 until the best value is found.

Let’s consider an example to build a linear regression model with a dataset with 50 observations Figure 2 contains the values graphed in Figure 3.

Figure 2. Sample data set and summary. Click to enlarge.

Figure 2. Sample data set and summary. Click to enlarge.

 

Figure 3. Histogram and scatter plot of sample data set. Click to enlarge.

Figure 3. Histogram and scatter plot of sample data set. Click to enlarge.

 

The above histogram of y indicates a non-normal distribution. The scatter plot indicates a relatively weak negative linear relationship of y with x.

Let’s now use Box-Cox transformation and create a new variable, z. Based on the above data, the best Lambda calculated is -0.5670. We will use this Lambda value to create the new variable z (Figure 4).

Figure 4. Histogram and scatter plot for the z variable. Click to enlarge.

Figure 4. Histogram and scatter plot for the z variable. Click to enlarge.

 

As shown in Figure 4, the histogram of z is indicative of normal distribution and the relationship of z with x is a stronger negative linear relationship.

Let’s see which of the variables delivers a better linear model with x (Table 1).

Independent VariableDependent variable:
yz
x-0.011522***-0.022611***
Constant1.2916550.467314
Adjusted R²0.43680.7584
* p < 0.05; ** p < 0.01; *** p < 0.001

Table 1. Linear regression model summary.

Related Stories

How to Improve Machine Learning: Tricks and Tips for Feature Engineering.
Read the story »

Enhance Machine Learning with Standardizing, Binning, Reducing.
Read the story »

Why Consultancies and Software Often Fail to Address Data Challenges.
Read the story »

Understanding the Promise and Pitfalls of Machine Learning.
Read the story »

There seems to a substantial improvement in Adjusted R² from 0.437 to 0.758 with the dependent variable z compared to y. Box-Cox transformation, in the example, led to a huge improvement in the model.

Although Box-Cox transformations are immensely useful, they shouldn’t be used indiscriminately. Sometimes, we may not be able to achieve normality in data even after using a Box-Cox transformation, and any such transformed data should be thoroughly checked for normality.

Missing Data

Even though complete data is such a rarity, missing data is one the most excruciating pain points for analysts. More often than not, even after exhausting all possible avenues to get the actual values, one is still left with missing data. In such cases, it is often preferable to infer the missing values.

Let’s look at an example to understand the importance of inferring missing values: With the information on Visits, Transactions, Operating System, and Gender, we need to build a model to predict Revenue. The data and summary appear in Table 2.

First 5 rows
VisitsTransactionsOSGenderRevenue
70AndroidMale0
201iOSNA577
221iOSFemale850
242iOSFemale1050
10AndroidMale0
Data Summary
VisitsTransactionsOSGenderRevenue
Min. : 0.00Min. :0.000Android:16028Female: 2670Min. : 0.0
1st Qu.: 6.001st Qu.:1.000iOS : 6772Male :147301st Qu.: 170.0
Median :12.00Median :1.000NA : 5400Median : 344.7
Mean :12.49Mean :0.993Mean : 454.9
3rd Qu.:19.003rd Qu.:1.0003rd Qu.: 576.9
Max. :25.00Max. :2.000Max. :2000.0
NA :1800
NA = Missing Observations

Table 2. Sample data set and summary.

We have a total of 7,200 missing data points (Transactions: 1,800, Gender: 5,400) out of 22,800 observations. Almost 8 percent and 24 percent of data points are missing for Transactions and Gender, respectively.

We will be using a linear regression model to predict Revenue by ignoring the missing data and by inferring the missing data.

Impute Gender by Decision Tree. There are several predictive techniques – statistical and machine learning – for imputing missing values. We will be using decision trees to impute the missing values of Gender. The variables used to impute it are Visits, OS, and Transactions.

Impute Transactions by Linear Regression. Using a simple linear regression, we will impute Transactions by including the imputed missing values for Gender (imputed from the decision tree). The variables used to impute it are Visits, OS, and Gender.

Now that we have imputed the missing values, we can build the linear regression model and compare the results of the models built by ignoring the missing values and by inferring the missing values (Table 3).

Independent VariableDependent variable:
Revenue
Model A (ignored missing values)Model B (inferred missing values)
Transactions418.273***405.619***
OS:iOS243.864***243.300***
Gender:Male-238.319***-240.786***
Constant171.883***186.184***
Observations15,60022,800
Adjusted R²0.7190.776

* p < 0.05; ** p < 0.01; *** p < 0.001

Table 3. Linear regression model summary.

Visits is not included in the models because this variable is statistically insignificant at the 5 percent significance level. It is evident from the above table that Model B is much better compared to Model A, because the Adjusted R² is much better.

Imputation of missing values is a tricky subject, but imputing such missing values by a predictive model is highly desirable because it can lead to better insights and performance of your predictive models.

Dummy Variables

When dealing with nominal categorical variables with more than two levels, like the levels in an operating system, it is better to create dummy variables – i.e., different variables for each level. This way, you would get a different coefficient or multiplier factor for each level and not suffer from assuming any order in the attribute/levels in the categorical variable. To avoid creating redundancy in data, the trick is to create one dummy variable less than the number of levels in the categorical variable.

In Part 2, we built a better model by binning the numerical variable Age. We can improve this model further by creating dummy variables for Age Group and leave out any statistically insignificant variables (Table 4).

Independent VariableDependent variable:
Interact
Model A – Without DummyModel B – With Dummy
Age-0.062*-0.03776**
Age Group: >=21 & < 421.596*
Age Group:> 421.236
OS:iOS2.534***
>=21 & < 42 (Dummy)1.05887*
iOS (Dummy for OS)2.40673***
Constant0.6960.46429
Observations165165
Residual Deviance149.93150.97
AIC159.93158.97
* p < 0.05; ** p < 0.01; *** p < 0.001

Table 4. Logistic regression model summary based on two sets of dependent variables.

Instead of using the categorical variable Age Group, we have created two new variables, >= 21 and < 42 and >= 42. We don’t need to create the variable < 21 because if the data indicates the user is not in the age group >= 21 and <42 and >= 42, then the user is under 21.

If you look closely at the coefficients, you notice that the users in the age group >= 42 have a positive coefficient with Interact – that is, users in that age group have a higher probability to interact compared with the base-level users in the age group < 21. This seems counter-intuitive, as interaction with the app reduced substantially for users age 42 years and older (Table 5).

< 21>= 21 and < 42>= 42
0102233
131609

Table 5. App usage by age group.

This shortcoming gets addressed in the model built with dummy variables. The dummy variable created for users in the age group >= 42 is not included in Model B, as it is statistically insignificant.

Let’s look at another example, in which the task is to predict Revenues from Gender, OS, Visits, and Age Group A data set and summary appear in Table 6.

First 5 rows
GenderOSVisitsAge GroupRevenue
MaleAndroid6B130.45
FemaleAndroid7D79.86
MaleWindows12B398.17
FemaleiOS8D221.91
FemaleWindows8D114.27
 

Data Summary

GenderOSVisitsAge GroupRevenue
Female : 29Android : 29Min. : 1.0A : 26Min. : 0.0
Male : 71Windows : 381st Qu.: 6.00B : 271st Qu.: 134.0
iOS : 33Median : 8.00C : 20Median : 201.7
Mean : 7.79D : 27Mean : 219.6
3rd Qu.: 10.003rd Qu.: 291.7
Max. : 14.00Max. : 527.9

Table 6. Sample data set and summary.

We will build two models: Model A, with no dummy variables for OS and Age Group, and Model B, with dummy variables for OS and Age Group (Table 7).

Independent VariableDependent variable:
Revenue
Model A – Without DummyModel B – With Dummy
Visits33.1408***33.057***
Gender:Male58.983***
OS:Windows3.945
OS:iOS111.662***
Age Group:B47.081***
Age Group:C-2.332
Age Group:D-61.1***
iOS (Dummy for OS)107.660***
B (Dummy for AgeGroup)47.085***
Constant-72.979***-128.079***
Adjusted R²0.8850.8868
* p < 0.05; ** p < 0.01; *** p < 0.001

Table 7. Linear regression model summary for models with and without dummy variables.

The Adjusted R² has shown a slight improvement for Model B but, more importantly, Model B requires less variable information than Model A.

Dummy variables can improve the robustness of a model by reducing the risk of overfitting. This is possible due to the lower number of variables required for building the model while maintaining or improving the model performance.

Interaction

A typical linear regression, assuming we have two independent variables, is of the formPart 3 Equation 2where y = dependent variable, x1 and x2 = independent variables, β1= coefficient for x1, and β2= coefficient for x2.

Each coefficient indicates the effect of the corresponding independent variable on the dependent variable while keeping all others variables constant. But what if we want to understand the effect of two or more independent variables on the dependent variable jointly? We can achieve this by adding an interaction term, as shown below:

Part 3 Equation 3

where β3 = coefficient for interaction term x1 . x2

Let’s look at the same binning example to understand interaction. We will build our model by adding an interaction term with Age and OS and compare this with the default model (Table 8).

Independent VariableDependent variable:
Interact
Model A – Without InteractionModel B – With Interaction
Age-0.049***-0.02708
OS:iOS2.150***8.14033**
OS iOS:Age (Interaction Term)-0.20189*
Constant1.483**0.65684
Observations165165
Residual Deviance157.2144.77
AIC163.2152.77
* p < 0.05; ** p < 0.01; *** p < 0.001; . p < 0.1

Table 8. Logistic regression model summary with and without interaction terms.

Model B, with the interaction term, has shown a good improvement over Model A, as can be observed from the lower Residual Deviance and AIC in Table 8. It shows that Age jointly with OS is showing a statistically significant relationship with Interact.

Jacob Joseph CleverTap thumbnail

 

Jacob Joseph is a Senior Data Scientist for CleverTap, a digital analytics, user engagement, and personalization platform, where he is an integral part in leading their data science team. His role encompasses deriving key actionable business insights and applying machine learning algorithms to augment CleverTap’s effort to deliver world-class, real-time analytics to its 2,500 customers worldwide. He can be reached at jacob@clevertap.com.

 

Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.



Bridge the Gap Between Business and IT: Integrating Data into Business Workflows




Tags: , , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>