With all the buzz surrounding big data, the data management practitioner is constantly inundated with information regarding big data technologies.
After identifying which big data problems an organization must solve, the next step is understanding the advantages and disadvantages of different approaches to address these challenges. Most importantly, the practitioner must make a case: why collect more data or develop more sophisticated algorithms?
There is a debate going on, and many experienced statisticians argue that the secret to taming your big data problems is by embracing the size of your detail data, rather than the complexity of your models. (Detail data are the attributes and interactions of entities—usually users or customers. Preferences, impressions, clicks, ratings and transactions are all examples of detail data.)
Dozens of articles have been written detailing how more data beats better algorithms. But very few address why this approach yields the greatest return.
In a nutshell, having more data allows the “data to speak for itself,” instead of relying on unproven assumptions and weak correlations.
While you can invest vast amounts of resources in algorithm development, often the smarter option is to invest in data collection and accessibility, which is more economically viable and provides greater prediction accuracy.
This article explains why having more training data can improve the accuracy of a model and allow organizations to better serve their users.
Training data can be defined as the subset of relevant data you use when doing analysis and building predictive models. Because you often don’t know what data is going to be relevant in predicting user behavior, best practices dictate that we collect everything we can. The amount of training data you have available can never be more than the total data you collect, which is why new data infrastructure like Hadoop is valuable: you can afford to collect everything.
The New Bottleneck
When data management technologies first came to market years ago, hardware was the primary bottleneck with thin network pipes, small hard disks and slow processors. As hardware improved, it became possible to create software that distributed storage and processing across large numbers of independent servers. This new type of software takes care of reliability and scalability issues across hardware devices, resulting in a platform that can scale with the data being collected.
Hadoop is the software platform that enables large-scale collection and storage of detail data at low cost so you can afford to “collect everything.” Software frameworks like Kiji and Impala make Hadoop data accessible to analysts for predictive modeling development and deployment.
The current bottleneck most analysts face is finding software that allows them to make sense of all the detail data. Instead of spending vast amounts of time sifting through an avalanche of data, data scientists need tools to determine what subset of data to analyze and how to make sense of it.
The next generation of business intelligence software is tackling this challenge by using the full amount of data available to create more accurate algorithms and garner better results.
Why More Data Beats Better Algorithms
The logic behind the concept that more data beats better algorithms is rather subtle. Say, for example, we believe the relationship between two variables – such as number of pages viewed on a website and percent likelihood of a website visitor to make a purchase – is linear. Having more data points would improve our estimate of the underlying linear relationship. The two graphs in Listing 1, show that more data will give us a more accurate and confident estimation of the linear relationship.
Simple correlations between and two variables are common in retail, finance, and mobile applications. In retail, we estimate the probability that a user will check out given the contents of their shopping cart. In finance, we compute the probability that a transaction is fraudulent given a ZIP code. In mobile, we compute the probability a user will redeem an offer given a GPS location.
However, when comparing the two graphs above in Figure 1, notice that more data points do not affect the linear estimate—the result is virtually the same once you have “enough data.” The “trick” in effectively using more data is to make fewer initial assumptions about the underlying model and let the data guide which model is most appropriate.
In the above example, we assumed the linear model before collecting data to parameterize the relationship. Next, we’ll examine how the data itself can provide actionable insight beyond a linear relationship.
More Data Can Reveal a Non-Linear Relationship Within a Dataset
Many organizations build complicated models that use a smaller subset of data to determine what content should be offered to the user next.
Let’s say the graph above in Figure 1 represents a predictive model for a recommendations engine for a sports website that delivers national sporting news to subscribers. The linear model suggests that there is a strong correlation between “reading about football” and “reading about soccer.”
The X-axis represents how often users read football news and the Y-axis represents the likelihood these users will also read soccer news.
However, the graphs above show that the true relationship is not quite linear. The U-shaped dip between 10 and 30 on the X-axis cannot be captured using our linear model, yet it provides tremendous insight into individual user behavior.
Detail data allows us to pick a nonparametric model—such as estimating a distribution with a histogram—and provides more confidence that we are building an accurate model.
If we have significantly more data we can accurately represent our model as a histogram, as depicted in Figure 2, better capturing the relationship between the variables. We can essentially forego the linear parametric model for a simple density estimation technique.
With more data, the simpler solution (estimating a distribution with a histogram) actually becomes more accurate than the sophisticated solution (estimating the parameters of a model using a linear regression).
This insight allows the sports website to better serve its subscribers by building recommendations engines that adapt to user preferences. Other examples in which this technique is relevant include item similarity matrices for millions of products, and association rules derived using collaborative filtering techniques.
Nonparametric Models Win
Overall, a weak assumption coupled with complex algorithms is far less efficient than using more data with simpler algorithms.
If this were a much larger parameter space, you could imagine that the model itself could be very large (the data representing just the red histogram). Nonparametric models are becoming more commonplace in big data analysis, especially when the model is too large in memory to fit on a single machine.
Next generation open source frameworks, such as Kiji and Cloudera’s Impala, have been designed to support distributed training sets and distributed model representations, taking full advantage of nonparametric model techniques. By simplifying your models and increasing the data available, enterprises can better automate the sales and marketing funnel, create more effective calls to action and increase customer lifetime value.
Garrett Wu is vice president of engineering at WibiData. A former technical lead at Google’s personalized recommendations team, he now focuses on natural language processing, machine learning and data mining.