Enhance Machine Learning with Standardizing, Binning, Reducing

by   |   July 20, 2016 5:30 am   |   0 Comments

Editor’s Note: This is the second in a four-part series on improving analytics output with feature engineering. Click here to read all the entries in the series.

 

In Part 1 of this series, we looked at some of the popular tricks for feature engineering and a broad overview of each. In this part, we will look at the first three tricks in detail.

Standardizing Numerical Variables

Standardization is a popular pre-processing step in data preparation. It is done to bring all the variables on the same scale so that your machine-learning algorithms give equal importance to all the variables and do not distinguish based on scale.

Let’s consider an example with K-means clustering, a popular data mining and unsupervised-learning technique. We will work with publicly available wine data from the UCI Machine Learning Repository. The dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. A sample data set is shown in Table 1, and a summary of the data is offered in Figure 1.

TypeAlcoholMalicAshAlcalinityMagnesiumPhenolsFlavanoidsNonflavanoids
114.231.712.4315.61272.83.060.28
113.21.782.1411.21002.652.760.26
113.162.362.6718.61012.83.240.3
114.371.952.516.81133.853.490.24
113.242.592.87211182.82.690.39
114.21.762.4515.21123.273.390.34
114.391.872.4514.6962.52.520.3
114.062.152.6117.61212.62.510.31
114.831.642.1714972.82.980.29
113.861.352.2716982.983.150.22

Table 1. Sample data set of chemical analyses of wine.

 

Figure 1. Summary of wine sample data set. Click to enlarge.

Figure 1. Summary of wine sample data set. Click to enlarge.

 

Related Stories

How to Improve Machine Learning: Tricks and Tips for Feature Engineering.
Read the story »

How Shutterstock Uses Machine Learning to Improve the User Experience.
Read the story »

5 Ways Machine Learning Reinvents IT Root Cause Analysis.
Read the story »

Troubleshoot Virtual IT Environments with Machine Learning.
Read the story »

As can be observed from the summary, the variables aren’t on the same scale. To identify the three types of wine (see “Type”), we will cluster the data using K-means clustering, with and without bringing the variables on the same scale.

A wide variety of indices have been proposed to the find the optimal number of clusters in partitioning the data during the clustering process. We will use NbClust, a popular package in R that provides up to 30 indices for determining the ideal number of clusters, given a range of clusters. The cluster that is chosen by the maximum number of indices will be the ideal cluster size. We shall iterate between two and 15 and select the ideal cluster number with the help of NbClust.

Without standardization. Figure 2 shows that two is the ideal cluster number among all the clusters provided, as the majority of the indices have proposed two. But, from the data, we know that there are three types of wine. So we can easily reject this.

 

Figure 2. Number of clusters chosen by 26 indices. Click to enlarge.

Figure 2. Number of clusters chosen by 26 indices. Click to enlarge.

 

With standardization. We will standardize the wine data score provided using z-score and ignoring the “Type” column. Figure 3 shows the data summary post standardization:

 

Figure 3. Summary of wine data, post standardization. Click to enlarge.

Figure 3. Summary of wine data, post standardization. Click to enlarge.

 

Now let’s run the NbClust algorithm to estimate the ideal number of clusters (Figure 4).

 

Figure 4. Number of clusters chosen by 26 indices, post standardization. Click to enlarge.

Figure 4. Number of clusters chosen by 26 indices, post standardization. Click to enlarge.

 

Figure 4 shows that three  is the ideal number of clusters among all the clusters provided, as the majority of the indices have proposed three. This clustering looks promising, considering the fact that there are three types of wine.

Evaluating the formed clusters. We will run the K-means algorithm based on the cluster number provided by NbClust to cluster the data. We will use a confusion matrix to evaluate the performance of the classification arrived at after clustering. A confusion matrix is a tabular representation of actual (wine type from data) vs. predicted values (clusters). The off-diagonals (in orange) represent the observations that are misclassified and the diagonals (in blue) represent the observations are correctly classified (Figure 5).

 

Figure 5. A confusion matrix showing the performance of clustering of wine data. Click to enlarge.

Figure 5. A confusion matrix showing the performance of clustering of wine data. Click to enlarge.

 

In an ideal scenario, all the observations corresponding to a wine type would belong to only one of the three clusters. In the above matrix, there are six (3 + 3) misclassified observations that belong to Type 2 wine, but three of those observations are present in Cluster 1 and the rest in Cluster 3, instead of Cluster 2.

With standardized data, the best cluster size predicted was more accurate than non-standardized data. Additionally, the classification performance based on the clustering with the standardized data was extremely encouraging.

Binning/Converting Numerical to Categorical Variable

Converting from numerical to categorical variables is another useful feature-engineering technique. It not only can help to interpret and visualize the numerical variable, but also can add an additional feature that could increase the performance of the predictive model by reducing noise or non-linearity.

Let’s look at an example in which we have data on the age of app users and whether they have interacted with the app during a particular time period. Table 2 contains the first 11 rows and the summary of the data and a summary of the data.

 

AgeOSInteract
18iOS1
23iOS1
20Android0
22Android0
21Android0
16iOS1
21Android1
79iOS0
16iOS0
22Android1
24iOS1
Data Summary
     Age              OS                Interact
 Min.   :16.00Android:980: 65
 1st Qu.:21.00iOS    :671:100
 Median :29.00
 Mean   :33.63
 3rd Qu.:42.00
 Max.   :79.00

Table 2. Data rows and summary of app-user data.

 

We have 165 users between 16 and 79 years old. Of these users, 98 are on Android and 67 on iOS. The 1 and 0 for the “Interact” variable refers to users who have interacted with the app frequently and occasionally, respectively.

We need to build a model to predict whether a user will interact with an app based on the above information. We will use two approaches, one in which we take age as it is, and other in which we create an additional variable by grouping or binning the age in buckets. Though we can use several methods, such as like domain expertise, visualization, and predictive models, to bin “Age,” we will bin “Age” based on a percentile approach (Table 3).

 

Age Summary
1st Quartile2nd Quartile3rd QuartileMeanMin.Max.
21294233.631679

Table 3. The Age variable binned based on a percentile approach.

 

Based on the above table, 25 percent of the users’ age is below 21, 50 percent is between 21 and 42 and the remaining 25 percent is greater than 42. We will use these breakpoints to bin the users into different age group buckets and create a new variable, “Age Group” (Table 4).

 

Age Group Summary
 Younger than 21Between 21 and 42 Older than 42
418242

Table 4. New variable based on binning app users’ ages.

 

Now that we have binned the users’ age, let’s build a model, with the help of logistic regression, to predict whether users will interact with the app since the dependent variable is a binary variable (Table 5).

 

Logistic Regression Model Summary
Independent VariableDependent variable:
Interact
Model AModel B
Age-0.049***-0.062*
AgeGroup: >=21 and < 421.596*
AgeGroup:> 421.236
OS:iOS2.150***2.534***
Constant1.483**0.696
Observations165165
Residual Deviance157.2149.93
AIC163.2159.93
* p < 0.05; ** p < 0.01; *** p < 0.001

Table 5. Model summary based on two sets of dependent variables.

Table 5. Model summary based on two sets of dependent variables.

 

Model A has taken only Age and OS as the independent variables, whereas Model B has taken Age, Age Group, and OS as the independent variables.

Model discrimination. There are various metrics to discriminate between various logistic regression models, such as residual deviance, log likelihood, AIC, SC, AUC, etc. For the sake of simplicity, we will look only at residual deviance and AIC (Akaike Information Criterion). The lower the number for the metrics mentioned, the better the model. Based on the results for AIC and residual deviance, Model B in Table 5 appears to be the better of the two models. Binning has proved to be useful in the above case because the variable formed due to binning is statistically significant for the model indicated by *, if we assume 5 percent cutoff for p-value. The binned variable is able to capture some part of the non-linearity of the relationship between Age and Interact. Model B can be improved further with the help of dummy variables, which is discussed below.

Reducing Levels in Categorical Variables

Rationalizing the levels or attributes in categorical variables could lead to better models and computational efficiency. Consider the example below, detailing the number of app launches by city (Table 6).

 

CityApp LaunchedPercentageCumulative
1NewYork_City238781452525
2LosAngeles200576422146
3SanDeigo171922641864
4SanFrancisco143268871579
5Arlington95512581089
6Houston62083186.595.5
7Philadelphia15282011.697.1
8Phoenix11461511.298.3
9Chandler7641010.899.1
10Dallas3342940.3599.45
11Austin1719230.1899.63
12Jacksonville1050640.1199.74
13Riverside668590.0799.81
14Pittsburgh573080.0699.87
15Mesa382050.0499.91
16Miami286540.0399.94
17FortWorth238780.0299.96
18Irvine191030.0299.98
19Tampa95510.0199.99
20Fresno47760.01100

Table 6. Number of app launches by city.

 

Table 6 shows that 95 percent of the app launches took place in six of the 20 cities. We can combine the remaining 14 cities, with the 5 percent share, into one level and name it “Others.”

Let’s look at the states where these cities are located and determine if a new category, “States,” could group all the cities (Table 7).

 

StateApp Launched
1Arizona1948457
2California51667531
3Florida143269
4NewYork23878145
5Pennsylvania1585509
6Texas16289671

Table 7. Cities where apps were launched, grouped by state.

 

As indicated by Table 7, the information related to the cities could be summarized by the corresponding six states.

In this example, we tried to reduce the categorical levels by two approaches: by combining the cities with a low number of app launches using a frequency distribution, and by creating a new variable, State, using logic. Reducing the levels helps make data computationally less expensive and easier for visualization. That, in turn, helps make better sense of data and could also reduce the danger of overfitting. Overfitting leads to good fit on the data used to build the model or in-sample data, but may poorly fit out-of-sample or new data.

Jacob Joseph CleverTap thumbnail

 

Jacob Joseph is a Senior Data Scientist for CleverTap, a digital analytics, user engagement, and personalization platform, where he is an integral part in leading their data science team. His role encompasses deriving key actionable business insights and applying machine learning algorithms to augment CleverTap’s effort to deliver world-class, real-time analytics to its 2,500 customers worldwide. He can be reached at jacob@clevertap.com.

 

Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.







Tags: , , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>