In Part 1 of this series, we looked at some of the popular tricks for feature engineering and a broad overview of each. In this part, we will look at the first three tricks in detail.
Standardizing Numerical Variables
Standardization is a popular pre-processing step in data preparation. It is done to bring all the variables on the same scale so that your machine-learning algorithms give equal importance to all the variables and do not distinguish based on scale.
Let’s consider an example with K-means clustering, a popular data mining and unsupervised-learning technique. We will work with publicly available wine data from the UCI Machine Learning Repository. The dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. A sample data set is shown in Table 1, and a summary of the data is offered in Figure 1.
Table 1. Sample data set of chemical analyses of wine.
As can be observed from the summary, the variables aren’t on the same scale. To identify the three types of wine (see “Type”), we will cluster the data using K-means clustering, with and without bringing the variables on the same scale.
A wide variety of indices have been proposed to the find the optimal number of clusters in partitioning the data during the clustering process. We will use NbClust, a popular package in R that provides up to 30 indices for determining the ideal number of clusters, given a range of clusters. The cluster that is chosen by the maximum number of indices will be the ideal cluster size. We shall iterate between two and 15 and select the ideal cluster number with the help of NbClust.
Without standardization. Figure 2 shows that two is the ideal cluster number among all the clusters provided, as the majority of the indices have proposed two. But, from the data, we know that there are three types of wine. So we can easily reject this.
With standardization. We will standardize the wine data score provided using z-score and ignoring the “Type” column. Figure 3 shows the data summary post standardization:
Now let’s run the NbClust algorithm to estimate the ideal number of clusters (Figure 4).
Figure 4 shows that three is the ideal number of clusters among all the clusters provided, as the majority of the indices have proposed three. This clustering looks promising, considering the fact that there are three types of wine.
Evaluating the formed clusters. We will run the K-means algorithm based on the cluster number provided by NbClust to cluster the data. We will use a confusion matrix to evaluate the performance of the classification arrived at after clustering. A confusion matrix is a tabular representation of actual (wine type from data) vs. predicted values (clusters). The off-diagonals (in orange) represent the observations that are misclassified and the diagonals (in blue) represent the observations are correctly classified (Figure 5).
In an ideal scenario, all the observations corresponding to a wine type would belong to only one of the three clusters. In the above matrix, there are six (3 + 3) misclassified observations that belong to Type 2 wine, but three of those observations are present in Cluster 1 and the rest in Cluster 3, instead of Cluster 2.
With standardized data, the best cluster size predicted was more accurate than non-standardized data. Additionally, the classification performance based on the clustering with the standardized data was extremely encouraging.
Binning/Converting Numerical to Categorical Variable
Converting from numerical to categorical variables is another useful feature-engineering technique. It not only can help to interpret and visualize the numerical variable, but also can add an additional feature that could increase the performance of the predictive model by reducing noise or non-linearity.
Let’s look at an example in which we have data on the age of app users and whether they have interacted with the app during a particular time period. Table 2 contains the first 11 rows and the summary of the data and a summary of the data.
|Min. :16.00||Android:98||0: 65|
|1st Qu.:21.00||iOS :67||1:100|
Table 2. Data rows and summary of app-user data.
We have 165 users between 16 and 79 years old. Of these users, 98 are on Android and 67 on iOS. The 1 and 0 for the “Interact” variable refers to users who have interacted with the app frequently and occasionally, respectively.
We need to build a model to predict whether a user will interact with an app based on the above information. We will use two approaches, one in which we take age as it is, and other in which we create an additional variable by grouping or binning the age in buckets. Though we can use several methods, such as like domain expertise, visualization, and predictive models, to bin “Age,” we will bin “Age” based on a percentile approach (Table 3).
|1st Quartile||2nd Quartile||3rd Quartile||Mean||Min.||Max.|
Table 3. The Age variable binned based on a percentile approach.
Based on the above table, 25 percent of the users’ age is below 21, 50 percent is between 21 and 42 and the remaining 25 percent is greater than 42. We will use these breakpoints to bin the users into different age group buckets and create a new variable, “Age Group” (Table 4).
|Age Group Summary|
|Younger than 21||Between 21 and 42||Older than 42|
Table 4. New variable based on binning app users’ ages.
Now that we have binned the users’ age, let’s build a model, with the help of logistic regression, to predict whether users will interact with the app since the dependent variable is a binary variable (Table 5).
|Logistic Regression Model Summary|
|Independent Variable||Dependent variable:|
|Model A||Model B|
|AgeGroup: >=21 and < 42||1.596*|
|* p < 0.05; ** p < 0.01; *** p < 0.001|
Table 5. Model summary based on two sets of dependent variables.
Table 5. Model summary based on two sets of dependent variables.
Model A has taken only Age and OS as the independent variables, whereas Model B has taken Age, Age Group, and OS as the independent variables.
Model discrimination. There are various metrics to discriminate between various logistic regression models, such as residual deviance, log likelihood, AIC, SC, AUC, etc. For the sake of simplicity, we will look only at residual deviance and AIC (Akaike Information Criterion). The lower the number for the metrics mentioned, the better the model. Based on the results for AIC and residual deviance, Model B in Table 5 appears to be the better of the two models. Binning has proved to be useful in the above case because the variable formed due to binning is statistically significant for the model indicated by *, if we assume 5 percent cutoff for p-value. The binned variable is able to capture some part of the non-linearity of the relationship between Age and Interact. Model B can be improved further with the help of dummy variables, which is discussed below.
Reducing Levels in Categorical Variables
Rationalizing the levels or attributes in categorical variables could lead to better models and computational efficiency. Consider the example below, detailing the number of app launches by city (Table 6).
Table 6. Number of app launches by city.
Table 6 shows that 95 percent of the app launches took place in six of the 20 cities. We can combine the remaining 14 cities, with the 5 percent share, into one level and name it “Others.”
Let’s look at the states where these cities are located and determine if a new category, “States,” could group all the cities (Table 7).
Table 7. Cities where apps were launched, grouped by state.
As indicated by Table 7, the information related to the cities could be summarized by the corresponding six states.
In this example, we tried to reduce the categorical levels by two approaches: by combining the cities with a low number of app launches using a frequency distribution, and by creating a new variable, State, using logic. Reducing the levels helps make data computationally less expensive and easier for visualization. That, in turn, helps make better sense of data and could also reduce the danger of overfitting. Overfitting leads to good fit on the data used to build the model or in-sample data, but may poorly fit out-of-sample or new data.
Jacob Joseph is a Senior Data Scientist for CleverTap, a digital analytics, user engagement, and personalization platform, where he is an integral part in leading their data science team. His role encompasses deriving key actionable business insights and applying machine learning algorithms to augment CleverTap’s effort to deliver world-class, real-time analytics to its 2,500 customers worldwide. He can be reached at firstname.lastname@example.org.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.