Data Preparation: What Is It and Why Is It Important?
When you’re cooking, preparation is an essential step. Ingredients need to be collected, peeled, marinated and put where you will be able to reach them when the oil is hot or the oven reaches the right temperature.
This is also true in any business analytics and intelligence-driven process – today data comes from increasingly disparate sources and in an ever-growing variety of forms. The insights you are looking for could lie in images, communications, machine-to-machine interactions and real-time sensor data. Most probably, they will lie in a combination of more than one of these sources. This means that to get them to work together, they need the same care and attention as garden-fresh vegetables and prime cuts of meat do before you throw them in the pot.
Every project is likely to be different and involve different data, so there are no hard-and-fast checklists for each step you need to take to ensure your data is sufficiently prepped. However, in general, any operation that takes place on data as it is ingested into your system and processed through a particular analytical system can be considered as the data prep stage for that process.
All these operations share a common aim – to ensure your analytics processes receive error-free data in a consistent format that users can read, understand and work with.
Frequently, data preparation includes these steps:
Data Prep Strategy
Because all projects are different, the first step is always to start with a strategy. In terms of data preparation, this means formulating a workflow process that will cover all the steps your project needs and also determine how this strategy will be applied to every different type, or source, of data. To follow my cooking analogy, this would be the equivalent of making sure you have all of the ingredients listed in the recipe and knowing what you need to do to them before they go in the pot.
Data cleansing means removing data that is inaccurate, damaged, corrupt or erroneous in some other way. Thus, it is undesirable for this type of data to be taken into account during analytics. This process should pick up errors ranging from mistakes made during human data input to corrupt data caused by faulty sensors, data transfer systems or storage.
Metadata creation is labeling data to make it easier for your analytics systems to know what to do with it when they receive it at the end of your data prep process. Metadata tags data with information about information – for example, when and where a picture or video was taken or the age and rough geographical location of the sender of a customer complaint email.
This step involves putting data into the correct format for your analytics systems to work with. This means taking the data in whatever format it has been ingested – by scanners, sensors, cameras or manual human data input – and putting it into whatever database format your analytics engines will understand. Data can be compacted to save space and improve speeds at this point, and any elements that will not be read by your analytics processes can be discarded.
Analytics algorithms and software could expect dates, names, geographical location and a myriad of other features to be presented in a uniform way – for example, checking that all dates are in an eight-digit format rather than a six-digit one to avoid confusion during analysis comes under data standardization. Data can be checked at this point to make sure it falls into appropriate ranges – for example, if you are looking at customers only in a certain area, do all of the zip codes meet your requirements?
Is there anything else which can be added to your data – perhaps from publicly available datasets – that can make it more likely to reveal insights during analysis? Or it may be possible to extrapolate additional facts based on what is already known – carrying this out ahead of the target analytics will save processing time and ensure that you have the highest quality data before your algorithms go to work.
Do I Have to Do All This Preparation Myself?
Fortunately, you do not have to do all these steps alone. First, if your data initiative involves only one type or source of data – just video, just the names and addresses of customers or just transactional records – chances are that tools are already available that will handle your data just fine in its raw state.
However, with most Big Data projects, the volume, variety and velocity of the data involved is too great for it to be practical to carry out these tasks manually. In these cases, thankfully, a large and growing self-service market for data preparation tools has emerged.
Because of the uniform nature of the operations and the repetitive tasks involved, data preparation is an ideal candidate process for automation, and one-stop-shop solutions often delivered through simple web interfaces that require a minimum of data science training are becoming increasingly common.
Just as with cooking, when it comes to business intelligence and data, good, solid preparation can often be the difference between success and failure. Regardless of which process from this list you decide is necessary in your situation, a consistent data prep strategy should be a priority for anyone involved in digital transformation and data-driven discovery.
Bernard Marr is an internationally best-selling business author, keynote speaker and strategic advisor to companies and governments. He is one of the world’s most highly respected voices anywhere when it comes to data in business and has been recognized by LinkedIn as one of the world’s top 5 business influencers. You can join Bernard’s network simply by clicking here, explore his website here: bernardmarr.com, or follow him on Twitter @bernardmarr