With all the buzz surrounding big data, the data management practitioner is constantly inundated with information regarding big data technologies.

After identifying which big data problems an organization must solve, the next step is understanding the advantages and disadvantages of different approaches to address these challenges. Most importantly, the practitioner must make a case: why collect more data or develop more sophisticated algorithms?

There is a debate going on, and many experienced statisticians argue that the secret to taming your big data problems is by embracing the size of your detail data, rather than the complexity of your models. (Detail data are the attributes and interactions of entities—usually users or customers. Preferences, impressions, clicks, ratings and transactions are all examples of detail data.)

Dozens of articles have been written detailing how more data beats better algorithms. But very few address *why* this approach yields the greatest return.

**Related Stories**

Opinion: Avoiding pitfalls in predictive analytics models.

**Read the story »**

Podcast: Advanced analytics adoption still low, an opportunity to get ahead.

**Read the story »**

Hadoop sandboxes offer experimental spaces for analytics modelers.

**Read the story »**

Hadoop system developers carry on quest for real-time queries.

**Read the story »**

In a nutshell, having more data allows the “data to speak for itself,” instead of relying on unproven assumptions and weak correlations.

While you can invest vast amounts of resources in algorithm development, often the smarter option is to invest in data collection and accessibility, which is more economically viable and provides greater prediction accuracy.

This article explains why having more training data can improve the accuracy of a model and allow organizations to better serve their users.

Training data can be defined as the subset of relevant data you use when doing analysis and building predictive models. Because you often don’t know what data is going to be relevant in predicting user behavior, best practices dictate that we collect everything we can. The amount of training data you have available can never be more than the total data you collect, which is why new data infrastructure like Hadoop is valuable: you can afford to collect everything.

**The New Bottleneck**

When data management technologies first came to market years ago, hardware was the primary bottleneck with thin network pipes, small hard disks and slow processors. As hardware improved, it became possible to create software that distributed storage and processing across large numbers of independent servers. This new type of software takes care of reliability and scalability issues across hardware devices, resulting in a platform that can scale with the data being collected.

Hadoop is the software platform that enables large-scale collection and storage of detail data at low cost so you can afford to “collect everything.” Software frameworks like Kiji and Impala make Hadoop data accessible to analysts for predictive modeling development and deployment.

The current bottleneck most analysts face is finding software that allows them to make sense of all the detail data. Instead of spending vast amounts of time sifting through an avalanche of data, data scientists need tools to determine what subset of data to analyze and how to make sense of it.

The next generation of business intelligence software is tackling this challenge by using the full amount of data available to create more accurate algorithms and garner better results.

**Why More Data Beats Better Algorithms**

The logic behind the concept that more data beats better algorithms is rather subtle. Say, for example, we believe the relationship between two variables – such as number of pages viewed on a website and percent likelihood of a website visitor to make a purchase – is linear. Having more data points would improve our estimate of the underlying linear relationship. The two graphs in Listing 1, show that more data will give us a more accurate and confident estimation of the linear relationship.

Simple correlations between and two variables are common in retail, finance, and mobile applications. In retail, we estimate the probability that a user will check out given the contents of their shopping cart. In finance, we compute the probability that a transaction is fraudulent given a ZIP code. In mobile, we compute the probability a user will redeem an offer given a GPS location.

However, when comparing the two graphs above in Figure 1, notice that more data points do not affect the linear estimate—the result is virtually the same once you have “enough data.” The “trick” in effectively using more data is to make fewer initial assumptions about the underlying model and let the data guide which model is most appropriate.

In the above example, we assumed the linear model before collecting data to parameterize the relationship. Next, we’ll examine how the data itself can provide actionable insight beyond a linear relationship.

**More Data Can Reveal a Non-Linear Relationship Within a Dataset**

Many organizations build complicated models that use a smaller subset of data to determine what content should be offered to the user next.

Let’s say the graph above in Figure 1 represents a predictive model for a recommendations engine for a sports website that delivers national sporting news to subscribers. The linear model suggests that there is a strong correlation between “reading about football” and “reading about soccer.”

The X-axis represents how often users read football news and the Y-axis represents the likelihood these users will also read soccer news.

However, the graphs above show that the true relationship is not quite linear. The U-shaped dip between 10 and 30 on the X-axis cannot be captured using our linear model, yet it provides tremendous insight into individual user behavior.

Detail data allows us to pick a nonparametric model—such as estimating a distribution with a histogram—and provides more confidence that we are building an accurate model.

If we have significantly more data we can accurately represent our model as a histogram, as depicted in Figure 2, better capturing the relationship between the variables. We can essentially forego the linear parametric model for a simple density estimation technique.

With more data, the simpler solution (estimating a distribution with a histogram) actually becomes more accurate than the sophisticated solution (estimating the parameters of a model using a linear regression).

This insight allows the sports website to better serve its subscribers by building recommendations engines that adapt to user preferences. Other examples in which this technique is relevant include item similarity matrices for millions of products, and association rules derived using collaborative filtering techniques.

**Nonparametric Models Win**

Overall, a weak assumption coupled with complex algorithms is far less efficient than using more data with simpler algorithms.

If this were a much larger parameter space, you could imagine that the model itself could be very large (the data representing just the red histogram). Nonparametric models are becoming more commonplace in big data analysis, especially when the model is too large in memory to fit on a single machine.

Next generation open source frameworks, such as Kiji and Cloudera’s Impala, have been designed to support distributed training sets *and* distributed model representations, taking full advantage of nonparametric model techniques. By simplifying your models and increasing the data available, enterprises can better automate the sales and marketing funnel, create more effective calls to action and increase customer lifetime value.

*Garrett Wu is vice president of engineering at WibiData. A former technical lead at Google’s personalized recommendations team, he now focuses on natural language processing, machine learning and data mining.*

## 12 Comments

I am not about the following statement –

“With more data, the simpler solution (estimating a distribution with a histogram) actually becomes more accurate than the sophisticated solution (estimating the parameters of a model using a linear regression).”

Here you are suggesting – Histogram is simpler model/solution than linear regression. ?

First of all, IMHO, both of these solutions are much simple to make the point you are making here. Secondly, if we consider these models – What sophistication are we talking about in estimating parameters of linear regression model ? – I am assuming you are talking about the coefficients or is there something I am missing here ?

Yes, I am suggesting that bucketing values into a histogram is “simpler” than a linear regression. By simple here, I mean that it makes fewer assumptions about the data, not that it is less work to train.

My apologies for the confusion. It was challenging for me to draw from examples that would communicate my meaning while also being understandable to a nontechnical audience. Perhaps I erred too far to the side of simplicity and failed to communicate the point. I’ll try to clarify.

In this example, the “sophistication” is the fact that the model author would have had to assume that these two variables are linearly correlated to begin with, an assumption that may not actually be true. You could imagine continuing to invest in adding “sophistication” to the model by engineering new features or assuming different underlying relationships between variables, but I argue that it often makes sense to instead invest in more data for the histogram.

Thanks I appreciate your response, and agree the difficulty in communicating the ideas you tried to convey here.

“simpler” and “sophisticated” are very subjective words to convey the point in a short writeup.

Irrespective, I understand your point that a model author may start with assumption of linear correlation, and that it may not be true. There may exist linear correlation between different set of features, and not between the set of features one started with.

If I am not wrong, isn’t this case best suited for iterative querying platforms/frameworks, as one has to start with some feature set assumption and validate models with different feature sets by iterative querying of the data set.

I agree in case of histograms – the non-parametric model – is “simpler” with fewer assumption of data than “linear regression” parametric model. In this example, the histograms are in effect providing distributions of data but there are limitations of non-parametric models and the scenario where they are useful. Do you agree ?

In general, I agree with the premise of the article, but I think calling “histogram” as “simpler” model and pitting it against “linear regression” as “complex” model is little far fetched.

Yes, all good points.

I agree that the implication of this is that minimizing the overhead in querying and iteration is a priority. In fact, my hope is to use this article to set up the argument that data-driven organizations should invest in the infrastructure and tooling to give data scientists access to more data, efficient processing of that data, and short iteration times between experiments and production. By doing so, model authors will be able to quickly validate hypotheses instead of making assumptions.

I also agree that non-parametric models have their limitations and scenarios where they are useful. And this example of histograms vs. linear regression, though somewhat contrived, was easy to depict visually.

Hi Garrett,

Thanks for this very interesting post. Great intro for a guy (like me) with a stats background but limited knowledge of the software side of big data.

I was wondering, did you actually obtain such a bizarre relationship between reading about football and reading about soccer? I’m trying to see what the underlying qualitatives could be, but struggling. (unless 20-30 corresponds to watchign the superbowl, which many people do without having a particular interest for “team & ball” sports)

I was also wondering whether the histogram approach was as strong with multiple linear regression, because it gets pretty hard to plot 4+ dimensional histograms. The visual aspect of a 2D histogram is brilliant, as it captures so much more info than a single number (i.e. the regression coefficient), so how do you maintain that with higher level dimensions?

And finally, as a student I’m very curious how much of what I’m learning is useful for the business world out there! It would awesome to learn more about your own career path… anywhere on the web I could find that?

I like the spirit of the article as it points out to some very important aspects of modelling. Some issues though deserve clarification.

-The data is incapable of speaking for itself. We (analysts), the choices we make, the assumptions we make, the methods we choose ultimately speak for the data.

-The logic behind more data beats better algorithm is suspect. I’m afraid is much more complex than simply saying more is better.

-Linearity (or lack thereof) has little to do with the size of the data (.

nice article.. I have worked with Japanese company whose top priority was user experience with UI on their website. Needless to say there is no clear answer what user really likes.. Steve Jobs (along with Forstall) showed that UI is really hard to predict but it is big winning factor.

We went with collect all behavior. we keep finding things that surprise us. Eg, users really love auto fillup but really hate when it fills wrong. We used that knowledge to design auto fill up with some intelligence in guessing the fills. Eg off the shelf market solution is to do text matching… but we had to tweak it to the product that company was making and thus return results even if the match is not exact. I suppose google has that. And I hate bing because frequently it guesses wrong. I think this example would have been more appropriate rather than games. I never watch games. I feel they are addicts :).. other good example would have been retain.. probably Amazon can shed some lights on it.. I do find their suggestion quite accurate even if I shop some item that is break from my history.

What you have called “Best Practice” would be considered blasphemy in statistics and science in general. The strategy “collect everything” could not be less scientific. You write: “Because you often don’t know what data is going to be relevant” as a justification for “collect everything.”

First, you are using one word to represent two very different things. Every type of statistical test makes “Assumptions” about the type of data that is appropriate for that analysis. An example would be the Heterogeneity of Variance assumption associated with ANOVA models. When comparing independent groups of people, the ANOVA procedure only produces reliable information on group differences when the groups have approximately the same variance, and there are a number of tests that will let the analyst know if there are significant differences between groups’ variance. This is an assumption.

But you also use the same word when discussing model development – what dimensions are on the model. That is not an assumption – that is called a “hypothesis,” which should be based on knowledge gained in previous analyses by you and others. In science, the data you collect is based on what you would need to have to check the validity of your hypothesis. If you do not start with a hypothesis – that is called a “fishing expedition”- NOT Best Practices. More like “totally unscientific terrible practices.”

In science, one has a theory, and one makes hypotheses and collects relevant data as potential evidence for the correctness of the theory. There is no reason under the sun for any Big Data analyst to say that they do not know what data will be relevant. If you are working in context, and have been performing analyses in context, then you had better goddamn know- or at the very least, have an educated guess regard in what will be relevant.

Collecting everything and then searching everything for something relevant to your enterprise would be viewed as “post-hoc” analysis, which is never considered correct unless it has been replicated, or confirmed in another analysis with hypothesis emerging from your post-hoc fishing expedition.

Also, when you collect everything, you practically exponentially increase your chances of seeing an effect where there really isn’t one. The more tests you perform, naturally, the higher and higher your chances for “false positives” become. In statistical terms, your collect everything strategy inflated your Type I error rate to totally unacceptable levels. Your conclusions become highly suspect, and this is not good in any way, for the enterprise.

A histogram is a certain type of graph available to visualize your data. It is not comparable in any way to a linear (or non-linear) regression, which is statistical test that elucidates how much of the variation in what you are trying to predict, can be accounted for by the dimensions you are using to make your prediction, and it tells you if your model is statistically significant. The procedure also estimates the “Beta” or weight that each dimension is assigned in the regression equation that maximizes the amount of variance your model explains.

“Collect Everything” is not justified by “because we can.” It’s not justified at all. In 5 years, you are going to be able to collect a great deal more on your fishing expedition, which will worsen your predictions more and more. If you continue with collect everything, I would estimate that as capacity increases you will become more likely to find effects that are not real. At some pony in your collect everything strategy, the real effects that your enterprise needs, will likely become extremely rare because your chances of finding false effects will become the most likely outcome of your non- scientific, non-statistical, nonsense approach to informing your enterprise. And then? After the majority of your predictions fail to materialize- your enterprise will be better off without you.

“totally unscientific terrible practices.” – Love it. Thanks for laying down the truth on “Big Data”

A better title would be something like, ‘How More Data Can Sometimes Outperform A Complex Model.’ Of course, you have not shown that and naturally, professionals already know this. However, this would be a more realistic title.

The problem with Data Science is independence and sample bias. Statistics is a fickle game and sample bias is all but inevitable. Conducting experiments and formulating predictions, based upon data with unknown, questionable (and often dubious) collection methods is an unreasonable responsibility. Example: let us assume your test example was collected from a large magazine publisher who owns X amount of magazines. For simplicity we will say the publisher only has two relevant sports titles. Title S (soccer)= 100 samples &

& title M( Sports illustrated)=100000

samples.

Now because Sports illustrated is listed as representing ALL major sports, we are unable to separate motive of purchase. Does everyone who buys sports illustrated, do so for the soccer coverage? ( I doubt it)

This introduces bias, and false correlation along the regression line. This is the importance of running an independence test, especially if applying a confidence interval.

It is the Quality of Data that is most important not the size, or methods of analysis.

Please keep publishing these posts, they help tons.