Online commerce, mobile interactions, and the Internet of Things (IoT) are driving the need for enterprises not only to react in real time, but also to predict the most likely scenario in order to make those interactions meaningful. This has given rise to the growing use of machine learning to make predictions based on the vast streams of data handled by organizations each day.
Machine learning is still relatively new to many enterprises, so the team here at WSO2 thought, “How can we bring machine learning to life?” The answer was obvious: American football.
Few professional sports generate as much data as American football. The sport would allow us to explore algorithms and test our own technology based on open source machine-learning capabilities running on the Apache Spark project. And, of course, we could have some fun along the way.
So far our project, “WSO2 BigDataGame,” has proven quite successful, correctly predicting the outcomes for 75 percent of the playoff games that have taken place. Here’s a look at how we got there.
American football basically has three seasons: preseason, regular season, and playoffs. We quickly determined that using data from preseason games was not ideal, because teams use this as an opportunity to find the best combinations while also resting good players rather than focus on winning.
Instead, we agreed that the regular-season data would be the most effective for building a machine-learning model that would model the correlation of a set of features into a winning probability. That way, we could predict the winning probability of any team given a set of features.
Now there’s no lack of sites hosting data on football games, and we found pro-football-reference.com, which had the cumulative data on all the teams for many years.
If you have used that site, you know that it looks like a giant set of spreadsheets linked together. Getting the data into the format we needed required a bit of filtering, but eventually we had a nice comma separated value (CSV) file with the 2012, 2013, and 2014 regular-season historical data. The information we have collected explains a few key factors that affect the winning of a game, such as strength in offense and defense, and efficiency of passing and rushing, among others.
Our machine-learning server comes with a few visualizations that helped us to get some insight into the collected data. Here is one such visualization, which clusters a sample of the dataset into a given number of clusters and then plots them against two selected numerical features. This graph gives you an insight; if the measure simple rating system (SRS) value for a team’s defense is high, the winning probability tends to be high as well (~ >0.53).
The next question was, “What do we use to predict future games?”
Finding the Right Algorithm
The dataset we gathered has 200 features, including the response variable – in other words, the feature we are going to predict. We picked “winning probability” as the response variable, since we are basically trying to model how each feature correlates to a team’s likelihood of winning a game.
The first question we asked ourselves was, “What type of a problem is this? Is it a classification, clustering, prediction of a value, or something else?” By the looks of it, our problem belonged to “prediction of a value” category.
Next, we had to determine the right algorithm category. Options included supervised learning, unsupervised learning, and reinforcement learning. We knew the winning probabilities for each historical data entry that we gathered; hence this problem belonged to the supervised learning algorithm category.
At this point, we knew that we needed to find an algorithm capable of predicting a value as well as using supervised learning. There were four algorithms that could satisfy this requirement: linear regression, ridge regression, lasso regression, and random forest regression. Following is a flow chart we used for selecting an algorithm.
We picked linear regression and random forest regression for our analysis. Linear regression did not perform well for us in this scenario. However, random forest regression gave us good results. We further tuned the hyper parameters and compared the models until we achieved a considerably good mean squared error (MSE), a measure of how good a regression model is.
The graphs above depict how we tuned the hyper parameters. The tuned values were 5 trees, 55 bins, and a depth of 30 for each tree. The generated model had an MSE of 0.000379, which is a very high accuracy rate. This figure below was provided by our machine-learning server in its model summary view:
The residual plot above shows the prediction error of the test dataset plotted against a selected feature. We built this model just before the wild-card round of the NFL playoffs, and we wanted to test the model against 10 previous games. Of our 10 predictions, seven were correct, and two of the three incorrect predictions were very close to margin (50 percent), as seen in the table below. So, we were comfortable with this model.
Next, our model correctly predicted the outcome of three out of four playoff games. However, we kept improving it by doing some feature engineering (excluding features, normalizing, etc.), and the updated model we came up with predicted all of the four playoff games correctly. At that point, we decided to refrain from changing the model further, since we were in the latter part of the playoffs leading up to the Super Bowl. The results and predictions for future games can be viewed – and tested – here.
Once we built our machine-learning model, we decided to use the REST API for our machine-learning server to get the predictions out to the WSO2 BigDataGame site. Basically, any user can go to the site, drag and drop two opposing teams into the center boxes, and the prediction API will retrieve the predicted results from the machine-learning server.
Our actual machine-learning server is deployed in an Amazon EC2 instance, and the Java Virtual Machine (JVM) has been given a minimum of 256MB and a maximum of 4GB memory. We did a load test on the prediction API to determine whether one instance would be enough to handle the expected load. The load test was carried out in few hours with 1,000 concurrent users, and the single-instance deployment was able to withstand it without any issues.
We have now been deployed for more than a week, and following are the load average and heap memory usage statistics collected by our machine-learning server.
So far, the machine-learning server has been running at a healthy level.
Now, this program isn’t perfect. It’s very much a hobbyist project at WSO2. It does, however, showcase how machine learning can be used to accurately predict not only sports outcomes, but also business scenarios. This is significant in a world where mobile, Web, and IoT interactions demand immediate, intelligent responses. As a dynamic, adaptable solution for using past and present data to predict future scenarios, machine learning is an ideal tool for helping enterprises convert their vast sums of data into actionable insights.
Nirmal Fernando is an associate technical lead at WSO2 and the team lead of WSO2 Machine Learner for predictive analytics. His areas of interest include cloud computing, elastic scaling and data science. Nirmal has contributed to the Apache Stratos PaaS framework project, where he was elected a Project Management Committee (PMC) member and committer.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.