This first recorded argument about the superiority of football teams probably occurred ten minutes after the discovery of pigskin. Before the current college playoff system was created, these discussions were largely perfunctory.
But now there is more at stake than ever, because admittance to the playoffs is by invitation only, and the bowl selection committee calls the shots – their deliberations are essentially just extensions of these arguments.
The question is: Did the committee get it right? Which schools were left out? We can apply machine learning, remove the bias, and address these questions.
Ranking the teams using Win/Loss
Consider the case of Michigan State and Alabama. The teams are set to play in the Cotton Bowl tonight and the winner will play for the national championship. There is a tremendous amount discussion around which team is better.
The teams are in different conferences and did not play each other during the regular season. Each had one loss and they shared no common opponents. The figure below shows highlights from their 2015 seasons (the direction of the arrow signifies a win if it points to the team, a loss otherwise):
The case for Alabama: good wins (beating another ranked team) against LSU and Florida, but a loss to Ole Miss.
The case for Michigan State: good wins against Ohio St., Michigan and Iowa, but a loss against Nebraska.
So far, the teams seem about evenly matched. If you go one level deeper, you see that Alabama beat Wisconsin, and Wisconsin beat Nebraska (who beat Michigan State), but lost to Iowa (who lost to Michigan State). This looks like important information but doesn’t seem to give either team the edge. Further, as you examine more and more links among teams, it becomes difficult for the human brain to process the information.
What is needed is a method that can simultaneously examine the entire schedule (of which the diagram is a very small piece) and assign ranks banks on each team’s entire win/loss record.
The PageRank Algorithm
To evaluate the quality of the committee’s ranks, a famous machine-learning algorithm was applied. PageRank is the name of the method used in the early days of Google to rank Internet search results.
Google doesn’t use PageRank anymore, but there is no shortage of on-line documentation on this algorithm. The Wikipedia page has the basic details, and there are numerous applications to business and science scenarios. Most machine learning packages, such as Apache Spark, have implementations of PageRank.
The essence of this algorithm is as follows:
- Start with every team with an equal rank
- Assign “strengths” as links from winning teams to losing teams
- Adjust the ranks based on these strengths
- Keep adjusting the ranks until they change a very small amount
The interpretation of the ranks is that the “good” teams are those that beat other “good” teams. Losses against “good” teams (and wins against bad teams) don’t significantly affect the ranks. In this manner, the method naturally learns the quality of the conferences (SEC, MAC, etc.).
All of the data and code required to reproduce this analysis is located in this repository.
The scores for this season were downloaded from a sports website. A script was written to transform the data into a matrix comprising the win-loss signals. The PageRanks were created in Octave with the Power Method and are displayed in the table below:
As the only major undefeated team, Clemson is a consensus #1 choice. At #2 and #3, Michigan State and Alabama are reversed from the committee’s rankings; this would not have affected their matchup in the semifinal game. However, the method places Stanford in the #4 slot instead of Oklahoma. This should be upsetting to the Cardinal’s fan base, since only the top 4 teams make the playoff and are eligible to play for the national championship. Looks like a huge miss by the committee!
There were some major differences between the two rankings:
- The Houston Cougars jumped up from 18 to 6 – most likely the committee punished them for playing in a weaker conference.
- TCU and Michigan were ranked 10 positions lower by PageRank.
- Other teams such as Bowling Green State, SDSU, and Appalachian State made the top 25 ranks but failed to show up in the committee’s rankings. These were teams with good records but played in second-tier conferences and couldn’t generate momentum to climb the committee’s rankings.
Bowl Game Analysis
The ranks can be applied to the bowl games to find potential upsets. Here are a few highlights in which the committee and PageRank have the teams ordered differently:
The algorithm produces ranks from which predictions (such as those above) can be generated and evaluated. Correctly identifying upsets will build the case for applying this method to sporting events. One thing is for sure, any remaining uncertainty about which team is better should be resolved by January 11th.
Joseph Blue is a Data Scientist at MapR. Joe assists customers in solving their big data problems, making efficient use of the Hadoop ecosystem to generate tangible results. Recent projects include debit card fraud and breach detection, lead generation from social data, customer matching through record linkage, lookalike modeling using browser history, and real-time product recommendations.
Prior to MapR, Joe was the Chief Scientist for Optum (a division of UnitedHealth) and the principal innovator in analytics for healthcare. As a Senior Fellow with OptumLabs, he applied machine learning concepts to healthcare issues such as disease prediction from co-morbidities, estimation of PMPY (member cost), physician scoring, and treatment pathways. As a leader in the payment integrity business, he built anomaly detection engines responsible for saving $100 million annually in claim overpayments.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.