How to Improve Machine Learning: Tricks and Tips for Feature Engineering

by   |   July 19, 2016 5:30 am   |   1 Comments

Editor’s Note: This is the first in a four-part series on improving analytics output with feature engineering. Click here to read all the entries in the series.


Jacob Joseph, Senior Data Scientist, CleverTap

Jacob Joseph, Senior Data Scientist, CleverTap

Predictive modeling is a formula that transforms a list of input fields or variables into some output of interest. Feature engineering is simply a thoughtful creation of new input fields from existing input fields, either in an automated fashion or manually, with valuable inputs from domain expertise, logical reasoning, or intuition. The new input fields could result in better inferences and insights from data and exponentially increase the performance of predictive models.

Feature engineering is one of the most important parts of the data preparation process, where deriving new and meaningful variables takes place. Feature engineering enhances and enriches the ingredients needed for creating a robust model. Many times, it is the key differentiator between an average and a good model.

Some of the common and popular tricks employed for feature engineering are discussed below:


Standardizing Numerical Variables

Numerical variables in your data set are generally on different scales, like height, weight, etc. It is advisable to standardize these variables to bring them on the same scale. A good example of standardization is body mass index (BMI), which is a measure used to determine if a person is underweight or overweight by standardizing the weight measurement with the height so that BMIs of different people are comparable.

Failure to standardize variables might result in algorithms placing undue significance to variables that are on a higher scale. This is especially true for many machine-learning algorithms, like SVM, neural network, K-means, etc. One way to standardize a variable is to divide the difference of each observation from the mean of the variable by the standard deviation of the variable (z-score).

But standardization is not advisable in all instances – for example, for well-defined attributes like latitude and longitude, where one might lose valuable information to standardization. So don’t use standardization indiscriminately.

Binning/Converting Numerical to Categorical Variable

Related Stories

5 Ways Machine Learning Reinvents IT Root Cause Analysis.
Read the story »

Understanding the Promise and Pitfalls of Machine Learning.
Read the story »

Troubleshoot Virtual IT Environments with Machine Learning.
Read the story »

How Machine Learning Will Improve Retail and Customer Service.
Read the story »

Feature binning, a popular feature-engineering technique for numerical variables, bins numerical variables based on techniques ranging from approaches based on simple percentile, domain knowledge, and visualization to predictive techniques. It offers a quick segmentation for better interpretability of the numerical variables. By creating different models for each bin, a more specific, relevant, and accurate model can be built for predictive models such as regression models. A common example is grades awarded to students based on their exam scores, which segments the students and makes interpretation a lot easier.

Reducing Levels in Categorical Variables

We often come across scenarios in which a categorical variable has many attribute levels, like the branches of a bank, postal codes, or products listed on an e-commerce website. Handling many attributes or levels might become cumbersome, and looking at the frequency distribution might reveal that a subset of such levels accounts for about 90 percent of the observations. Building a predictive model without treating the levels most likely will lead to a less robust model, and the computational efficiency also gets impacted negatively.

For example, a decision tree or random forest will tend to give more importance to the categorical variable with many levels, even though it may not deserve it. We can treat such categorical variables with predictive-modeling techniques, domain expertise, or even a simple frequency-distribution approach.

Transforming Non-normal Distribution to Normal

Normally distributed data is needed to use a number of statistical analysis tools and techniques (Figure 1).

Figure 1. Transforming data distribution. Click to enlarge.

Figure 1. Transforming data distribution. Click to enlarge.


In (c) above, the variable z1 is converted to a normal distribution by taking the log of the variable that converts the relationship of x and z1 from non-linear to linear, as shown in (b). Log transformation, though useful, is not guaranteed to work all the time. One can use other transformations, such as like taking the square root, cube root, etc. But experimenting with various combinations is not an ideal solution.

To change the distribution of variables from non-normal to normal or near normal, Box-Cox transformations are very useful. This helps you to get the value/parameter (Log, in the above case). Using this, the distribution of the variables can be transformed to a normal distribution.

Missing Data

Missing data is a reality that every analyst has to reckon with. It may make sense to deal with missing values before making any inference or building a predictive model. Imagine the implications of leaving out required information while completing tax return statements.

Dummy Variables

Often, you will be faced with a situation in which the Categorical variables have more than two levels or attributes.

For example, a category, Operating System, could have levels such as iOS, Android, and Windows. You could encode them as 0, 1, and 2, respectively for a regression model, which requires numerical variables as inputs. In such a case, you are assigning an order to the OS. But the distance between the encoded attributes, like 2 (Windows) minus 0 (iOS) doesn’t mean anything. Hence, the need to create dummy variables is to “trick” the algorithm into correctly analyzing attribute variables.

Instead of storing the information about different operating systems in one variable, you could create different variables for different operating systems, and each of these new variables will have only two levels representing its existence or non-existence in the observation. The number of dummy variables so created must be one less than the number of levels or it would lead to redundant information. In the above case, it is sufficient to create dummy variables for Android and iOS. If the observation indicates the absence of Android and iOS, it implies that Windows is present.

Dummy variables could  lead to better predictive models – by providing a different coefficient or multiplier factor for each level – and not suffer from assuming any order in the attribute/levels in the categorical variable.


Suppose you are the Credit Officer and have been given historical data related to income level (High, Low), the quantum of loans already held at the time of the application, and the current status of the loan (Default or Good) of the applicants post-disbursal of the loan. Your goal is to create a model that will predict if a new applicant will default on a loan.

You could create a regression model to predict the prospective defaulter with the independent variables of income level and quantum of previous loans held. This model will give you insights on how the probability of default is moving with different levels of income and previous loans (Figure 2).

Figure 2. Probability of default with different levels of income and previous loans. Click to enlarge.

Figure 2. Probability of default with different levels of income and previous loans. Click to enlarge.


The slope of the graph reveals that the rate of increase in default probability is the same for both income levels. The relationship between the variables is additive, i.e., the effect of income levels on default probability does not depend on previous loans held, and effect of previous loans held on default probability does not depend income level. This seems counter-intuitive, as one would guess that the combination of both income level and previous loans held should affect default probability.

Let’s create a new variable, in the form of an interaction variable, by combining income level and previous loans (income level x previous loans) (Figure 3).

Figure 3. Probability of default with an interaction variable combining income level and previous loans. Click to enlarge.

Figure 3. Probability of default with an interaction variable combining income level and previous loans. Click to enlarge.


From the slope of the above graph, one could infer that default probability increases more rapidly for applicants with low income than high income.

Reducing Dimensionality

Reducing dimensionality involves reducing the number of resources required to describe a large set of data. Suppose you have 1,000 predictor variables and there is a likelihood of high correlation among the variables. Due to the likelihood of high correlation, you might require only a subset of predictor variables. Additionally, using all the predictor variables might result in overfitting.

Suppose a data set consists of variables like height in inches and height in centimeters. In this case, it makes sense to use  height measured either in inches or centimeters. Another dataset could consist of variables like length, width, and area, all in the same units. Here, we don’t need area because this is a product of length and width.

The techniques chosen for reducing dimensionality depend on the type of variables. You would employ different methods for text, image, and numerical variables. One popular technique is to use hand engineered feature-extraction methods (e.g., SIFT, VLAD, HOG, GIST, LBP). Another approach is to learn features that are discriminative in the given context (e.g., Sparse Coding, Auto Encoders, Restricted Boltzmann Machines, PCA, ICA, K-means).

Intuitive and Additional Features

Based on the available raw data, additional and intuitive features could be created manually or automatically. In case of text data, for example, you might run an automated algorithm to create features like length of the word, number of vowels, n-gram, etc. You might also deal with the data manually, which might require domain expertise, common sense, or intuition.

As a data analyst, you are trying to discover the signal amid the noise. In this age of big data, the noise is only going to increase. It is imperative that you have at least a candle to guard against the darkness. With the prudent and judicious use of all or some of the tools mentioned above, you will be equipped with more than just a candle.

Jacob Joseph is a Senior Data Scientist for CleverTap, a digital analytics, user engagement, and personalization platform, where he is an integral part in leading their data science team. His role encompasses deriving key actionable business insights and applying machine learning algorithms to augment CleverTap’s effort to deliver world-class, real-time analytics to its 2,500 customers worldwide. He can be reached at

Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.

Tags: , , , , , ,

One Comment

  1. Posted September 1, 2016 at 11:06 am | Permalink

    Very nice post and a lot of useful information. I have observed many Data Science aspirants too busy learning algorithms and not knowing the importance of the Data Preparation phase, especially feature engineering.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>