Mixed Martial Arts (MMA) is an incredibly entertaining and technical sport to watch. It’s become one of the fastest growing sports in the world. I’ve been following MMA organizations like the Ultimate Fighting Championship (UFC) for almost eight years now, and in that time have developed a great appreciation for MMA techniques. After watching dozens of fights, you begin to pick up on what moves win and when, and spot strengths and weaknesses in certain fighters. However, I’ve always wanted to test my knowledge against the actual stats – like do accomplished wrestlers really beat fighters with little wrestling experience?
To do this, we need fight data, so I crawled and parsed all the MMA fights from Sherdog.com. This data includes fighter profiles (birth date, weight, height, disciplines, training camp, location) and fight records (challenger, opponent, time, round, outcome, event). After some basic data cleaning, I had a dataset of 11,886 fight records, 1,390 of which correspond to the UFC.
I then trained a random forest classifier from this data to see if a state-of-the-art machine learning model can identify any winning and losing characteristics. Over cross-validation with 10 folds, the resulting model scored a surprisingly decent AUC score of 0.69; a AUC score closer to 0.5 would indicate that the model can’t predict winning fights any better than random or fair coin flips.
So there may be interesting patterns in this data … Feeling motivated, I ran exhaustive searches over the data to find feature combinations that indicate winning or losing behaviors. Many hours later, several dozens of such insights were found.
Here are the most interesting ones (stars indicate statistical significance at the 5% level):
Top UFC Insights
Fighters older than 32 years of age will more likely lose
Fighters with more than 6 TKO victories fighting opponents older than 32 years of age will more likely win
Fighters from Japan will more likely lose
Fighters who have lost 2 or more KOs will more likely lose
Fighters with 3x or more decision wins and are greater than 3% taller than their opponents will more likely win
Fighters who have won 3x or more decisions than their opponent will more likely win
Fighters with no wrestling background vs fighters who do have one more likely lose
Fighters fighting opponents with 3x or less decision wins and are on a 6 fight (or better) winning streak more likely win
Fighters younger than their opponents by 3 or more years in age will more likely win
Fighters who haven’t fought in more than 210 days will more likely lose
Fighters taller than their opponents by 3% will more likely win
Fighters who have lost less by submission than their opponents will more likely win
Fighters who have lost 6 or more fights will more likely lose
Fighters who have 18 or more wins and never had a 2 fight losing streak more likely win
Fighters who have lost back to back fights will more likely lose
Fighters with 0 TKO victories will more likely lose
Fighters fighting opponents out of Greg Jackson’s camp will more likely lose
Top Insights over All Fights
Fighters with 15 or more wins that have 50% less losses than their opponents will more likely win
This was validated in 239 out of 307 (78%) fights*
Fighters fighting American opponents will more likely win
Fighters with 2x more (or better) wins than their opponents and those opponents lost their last fights will more likely win
Fighters who’ve lost their last 4 fights in a row will more likely lose
Fighters currently on a 5 fight (or better) winning streak will more likely win
Fighters with 3x or more wins than their opponents will more likely win
Fighters who have lost 7 or more times will more likely lose
Fighters with no jiu jitsu in their background versus fighters who do have it more likely lose
Fighters who have lost by submission 5 or more times will more likely lose
Fighters in the Middleweight division who fought their last fight more recently will more likely win
Fighters in the Lightweight division fighting 6 foot tall fighters (or higher) will more likely win
Note – I separated UFC fights from all fights because regulations and rules can vary across MMA organizations.
Most of these insights are intuitive except for maybe the last one and an earlier one which states 77% of the time fighters beat opponents who are on 6 fight or better winning streaks but have 3x less decision wins.
Many of these insights demonstrate statistically significant winning biases. I couldn’t help but wonder – could we use these insights to effectively bet on UFC fights? For the sake of simplicity, what happens if we make bets based on just the very first insight which states that fighters older than 32 years old will more likely lose (with a 62% chance)?
To evaluate this betting rule, I pulled the most recent UFC fights where in each fight there’s a fighter that’s at least 33 years old. I found 52 such fights, spanning 2/5/2011 – 8/14/2011. I placed a $10K bet on the younger fighter in each of these fights.
Surprisingly, this rule calls 33 of these 52 fights correctly (63% – very close to the rule’s observed 62% overall win rate). Each fight called incorrectly results in a loss of $10,000, and for each of the fights called correctly I obtained the corresponding Bodog money line (betting odds) to compute the actual winning amount.
I’ve compiled the betting data for these fights in this Google spreadsheet.
Note, for 6 of the fights that our rule called correctly, the money lines favored the losing fighters.
Let’s compute the overall return of our simple betting rule:
That’s a very decent return.
For kicks, let’s compare this to investing in the stock market over the same period of time. If we buy the S&P 500 with a conventional dollar cost averaging strategy to spread out the $520,000 investment, then we get a ROI of -7.31%. Ouch.
Keep in mind that we’re using a simple betting rule that’s based on a single insight. The random forest model, which optimizes over many insights, should predict better and be applicable to more fights.
Please note that I’m just poking fun at stocks – I’m not saying betting on UFC fights with this rule is a more sound investment strategy (risk should be thoroughly examined – the variance of the performance of the rule should be evaluated over many periods of time).
The main goal here is to demonstrate the effectiveness of data driven approaches for better understanding the patterns in a sport like MMA. The UFC could leverage these data mining approaches for coming up with fairer matches (dismiss fights that match obvious winning and losing biases). I don’t favor this, but given many fans want to see knockouts, the UFC could even use these approaches to design fights that will likely avoid decisions or submissions.
Anyways, there’s so much more analysis I’ve done (and haven’t done) over this data. Will post more results when cycles permit. Stay tuned.
26 thoughts on “Betting on UFC Fights – A Statistical Data Analysis”
Great article! MMA is such a complex sport. This is some great data. Regarding the following statistic, what did you refer to as the time frame for “more recently”?
“Fighters in the Middleweight division who fought their last fight more recently will more likely win.”
Thanks! “More recently” corresponds to a fighter’s last fight taking place more recently than his opponent’s.
Really interesting article. I enjoy few things more than a good statistical breakdown. The “Fighters older than 32 years old are 62% likely to lose” is intriguing. I wonder if 32 years of age is more less a definitive age where skills begin to diminish? And if so, how much does each additional year of age increase a fighters chance of losing? Thanks again. Great job.
Wow, that’s quite an ambitious project you took on. That would make an interesting book if you expanded on it. I wonder if investors/traders would be as interested in this information as MMA fans?
Interesting stats. I have a question about the Japanese fighters stat, ‘Fighters from Japan are more likely to lose’. How much does that overlap with other stats (ie, height, age, lack of wrestling background, etc)?
My own theory is that it comes down mostly to the wrestling background of american fighters, though I am hardly in any position to claim an expert opinion.
Interesting work. Noticed that for some of the rules the sample sizes are very small (some are less than 100 fights) – I would think that would affect the accuracy of the rules.
Definitely – more fights the better. Some of those rules lack a star which means they are not statistically significant. However, most of the insights (especially under All Fights) have better coverage (400+ fights).
Great work! Can you run the history of fighters winning if they are favored -400? or how about -300, -200, etc.?
Wow. You’re the man, Vik. One note you should make is that of the 104 wins by fighters 32 or older, 11 were by Anderson Silva (turned 32 on April 14, 2007). I’m not a statistician, so I don’t know if there’s a term for an anomaly such as this. But if you don’t count his fights, the percentage moves from 62% to 65%. And in gambling, every few points helps.
Thanks for the kinds! Yes it would be very interesting to see the stats based on the number of unique fighters (versus number fights as it is right now which will count the same fighter multiple times).
Your work is really amzing. I will also try to read all your other posts. Some topics of the statistic like AUC or the Random Forrest Calssifier really motivated me to do some research about statistic. Statistic is also the topic i got im Math right now.
Sabermetrics – have you read Moneyball? I think you would appreciate that book (based on the theme above)
It’s an interesting project. A couple of years ago I did the same thing with Sherdog, trained a neural network to make predictions and used them to bet on the fights. The problem I had was that every time I wanted to place a bet I had to re-crawl the site to update the fighter’s records and it was a very time consuming process. Could you say a little about how you achieved the crawl? Did you custom-build a crawler or use some off-the-shelf software? And did you search all possible fighter numbers or event numbers by generating the urls and sending off an HTTP request or is there a more sophisticated way of doing it?
Hey Will – That’s cool. How well did your model perform? I wrote a simple crawler / parser in Python. I didn’t really optimize the crawler but it is multi-processed so it runs in a reasonable amount of time (crawls all the fighter profiles in a few hours – not great but definitely doable for betting on weekend fights). The crawler goes through the organization / event links (which exist on a single Sherdog webpage) to get to the fighters. Also, if you crawl just UFC fights, it should finish much faster.
Evaluation was really tricky and I’m not sure I cracked it with a meaningful metric.
My strategy was to compare the odds as generated by the model with the implied odds coming from the initial lines offered by the bookies, and to compare both of these to the actual outcome of the match. The impediment to this approach was that bookies build a margin in their odds: i.e. P(A wins) + P(B wins) > 1. This makes determining the underlying odds as set by the bookies difficult. All one can say with certainty is that P(A wins) <= 1 – P(B wins). To outperform the odds makers in the long run one has to "out-predict" them by at least the (unknown) margins they place on their calculated odds.
I evaluated the network by plotting a function to show the model's P(A wins) against the bookie's odds on A winning, then by dividing the plot into buckets and finding the % of correct predictions in each one. This gave me an interval over which my model performed well. Typically this implied that I could bet on about 1 in every 3 fights and expect a long run profit.
I used a strategy of re-investing my pot each UFC, spreading it between as many matches as fell within my confidence interval. Long term growth of the pot was about 3% per event, but with a whopping variance. What killed the project in the end was the relative infrequency of the events compared with the low rate of return and the effort involved in re-crawling every month. I suspect that the betting strategy, as much as the model, was responsible for my inability to make serious gains. That said, reading your blog has inspired me to try again, if only I could write a reasonable crawler (I'm not a programmer by trade!)
Wow, what a cool project. I’m just getting my feet wet in this kind of data analysis. What kind of program(s) would you suggest using for this kind random forest analysis and data management?
When you say, “Fighters fighting opponents with 3x or less decision wins… ” do you mean “3 or less decision wins” or that the opponent has a third or less of the decisions wins of the given fighter?
The latter (by 3x I mean 3 times as many).
Thanks. I used each of the UFC insights as rules, each with a weight proportional to their accuracy, to see if I could predict the winner of a given match-up. It seemed to be working pretty well; it predicts that Anderson Silva would beat Sonnen, and that Lorenz Larkin would be a better match-up to Silva. But then, it also predicts that Gina Carano would beat Silva. Might be a while until we see that happen.
Um…just ready your BIO. Yeah, you’re probably busy frying bigger fish. Anyhooo…..awesome post.
Great article. I’m am currently starting my studies in Machine Learning and was wondering if you had the code and the database availalble on github somewhere? My main questions are 1) Would a Neural Network be better today and 2) What features did you use? I would assume that you’d have to create your own since the available databases that were out there aren’t in any shape to just run against a standard classifier 3) did you use supervised, unsupervised, or both?
Again, great article and wish I would’ve seen this back when you originally posted it.