In the previous parts of this series, we looked at an overview of some popular tricks for feature engineering, and examined those tricks in greater detail. In this part, we continue our closer examination of these approaches with a deeper dive into the final techniques described in Part 1. The examples discussed in this article can be reproduced with the source code and datasets available here.
As an analyst, you savor the scenario in which you have a lot of data. But, with a lot of data comes the added complexity of analyzing and making better sense of that data. Often, the variables within the data are correlated, but analysis or models built on untreated data may lead to poor analysis or to a model that overfits.
Analysts commonly employ dimension-reduction techniques to create new variables, which number less than the original variables, to explain the data set. For example, a data set might have 1,000 variables, but dimension reduction could create 50 new variables that are able to explain the original data quite well.
Dimension-reduction techniques are frequently used in image and video processing – generally, when you deal with a very large number of variables. These techniques include hand-engineered feature-extraction methods like SIFT, VLAD, HOG, GIST, and LBP, or features that are discriminative in the given context, like PCA, ICA, Sparse Coding, Auto Encoders, Restricted Boltzmann Machines, etc.
A popular technique called Principal Component Analysis (PCA) is used to emphasize variation and bring out strong patterns in a dataset. Let’s explore PCA visually in two dimensions before proceeding toward a multidimensional dataset.
Consider the following dataset, which contains the average test scores in physics and math of eight students (Table 1).
Table 1. Sample data set of physics and math scores of eight students.
From the scatter plot (Figure 1), it seems that there is a positive relationship between the marks in physics and math.
But what if we want to summarize the above data on just one coordinate instead of two? We have two options: Take all the values of physics and plot them on a line, and take all the values of math and plot them on a line (Figure 2).
It seems the variation for scores in physics is more than that in math. What if we choose the physics scores to be representative of the data set, since it varies more than the math scores and, anyway, the scores in physics and math both move together? Intuitively, that doesn’t seem right. Although we chose the variable that had the maximum variation in the data set, we are sacrificing the information about math scores. Let’s create new variables, which take a linear combination of the existing variables and maximize such variation (Table 2). The data set is a transformed version of the data set containing the marks. It is calculated by taking the principal components of the data.
Table 2. The physics and math data sets, transformed.
Principal components are essentially a linear combination of the variables, i.e., both PC1 and PC2 are a linear combination of the variables physics and math. For the two variables, we got two principal components (Figure 3).
From the above line graphs, we see that PC1 shows the maximum variation. And because it’s a representation of both of the original variables, it’s a better candidate to represent the data set than just physics.
Let’s observe the variance explained by both the components (Table 3).
|Importance of Components|
|Proportion of Variance||0.982||0.018|
Table 3. Variance in physics and math scores, as explained by PC1 and PC2.
It is clear from the above table that PC1 accounts for 98 percent of the variance in the data set and it could be used to represent the data set.
Let’s extend this same idea to a multidimensional scenario. We will do so with the help of the wine data set discussed in Part 2. The wine data set (Table 4) contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. A summary of the data set is shown in Figure 4.
Table 4. Sample data set of chemical analyses of wine. Click to enlarge.
We have a total of 13 numerical variables on which we can conduct PCA analysis (PCA can run on numerical variables only). Let’s check the variable importance after running the PCA algorithm (Table 5) and select the principal components based on the variation in the data set it explains.
|Proportion of Variance||0.362||0.192||0.111||0.071||0.066||0.049||0.042|
|Proportion of Variance||0.027||0.022||0.019||0.017||0.013||0.008|
Table 5. Importance of variables after PCA analysis.
Based on Table 5, it seems that more than 50 percent of the variance in the data is explained by the top two principal components, 80 percent by the top five principal components, and over 90 percent by the top eight.
Let’s further understand the relationship between the variables and confirm that the PCA has been able to capture the pattern in the data among the variables, with the help of a biplot (Figure 5).
Biplots are the primary visualization tool for PCA. The biplots plot the transformed data as points (shown in the form of row index of the data set) and the original variables as vectors (arrows) on the same graph. It also helps us visualize the relationship between the variables themselves.
The direction of the vectors, their length, and the angle between them all have a meaning. Let’s look at the angle between the vectors. The smaller the angle between the vectors, the more the variables are positively correlated. In the above plot, Alcalinity and Nonflavanoids have a high positive correlation due to the small angle between the vectors. The same can be said for Proanthocyanins and Phenois. Malic and Hue, are negatively correlated, as are Alcalinity and Phenois, as the vectors go in opposite directions.
We can verify this by choosing the row indices from the plot. Let’s choose row indices 131 and 140. As can be seen in Table 6, Alcalinity and Nonflavanoids moved together.
Table 6. Alcalinity and nonflavanoids are positively correlated.
Let’s look at row indices 23 and 33 (Table 7).
Table 7. Proanthocyanins and Phenois moved together.
Interestingly, the data points seem to form clusters indicated by the different colors corresponding to the type of wine. PCA can be useful not only for reducing the dimensionality of the data set, but also for clustering. Because we would be using only a subset of the principal components that explain the majority of the variation in the data set while building the predictive models, it could be very useful in reducing the problem of overfitting.
Intuitive and Additional Features
Sometimes, you may create additional features, which could be the result of domain knowledge and/or common sense, either manually or programmatically. Consider, for example, the following:
- How many times have you come across a data set that contains the birth date of a user? Are you using that information in the given form or are you transforming and creating a new variable, like Age of the User?
- You likely also have come across date and time stamps that contain information on date, hour, minutes, up to seconds, if not more. Would you be taking this information as it is? Wouldn’t it be useful if new variables, such as month of the year, day of week, hour of the day were created?
- Many businesses are seasonal in nature. Based on the nature of the industry, a new variable recognizing that seasonality could be created.
The steps preceding the predictive-modeling stage take up as much as 80 percent of an analyst’s time, and data preparation takes up a lion’s share of that. The importance of feature engineering in data preparation should not be underestimated. If done the right way, feature engineering could lead to better insights and more efficient and robust models.
Jacob Joseph is a Senior Data Scientist for CleverTap, a digital analytics, user engagement, and personalization platform, where he is an integral part in leading their data science team. His role encompasses deriving key actionable business insights and applying machine learning algorithms to augment CleverTap’s effort to deliver world-class, real-time analytics to its 2,500 customers worldwide. He can be reached at email@example.com.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.