Elo is a superior rating system to the ranking formulas used by the ATP and WTA. If you’ve spent much time reading this blog or listening to the podcast, you’ve probably heard me say that many times. But unless you’ve been exposed to Elo before, or done some research on your own, you might think of it as a sort of “magic” system. It’s worth digging in to understand better how it works.
The basic algorithm
The principle behind any Elo system is that each player’s rating is an estimate of their strength, and each match (or tournament) allows us to update that estimate. If a player wins, her rating goes up; if she loses, it goes down.
Where Elo excels is in determining the amount by which a rating should increase or decrease. There are two main variables that are taken into account: How many matches are already in the system (that is, how much confidence we have in the pre-match rating), and the quality of the opponent.
If you think about it for a moment, you’ll see that these two variables are a good approximation of how we already think about player strength. The more we already know about a player, the less we will change our opinion based on one match. Novak Djokovic’s round-robin loss to Dominic Thiem in London was a surprise, but only the most apocalyptic Djokovic fans saw the result as a disaster that should substantially change our estimate of his playing ability. Similarly, we adjust our opinion based on opponent quality. A loss to Thiem is disappointing, but a loss to, say, Marco Cecchinato is more concerning. The Elo system incorporates those natural intuitions.
Elo rating ranges
Traditionally, a player is given an Elo rating of 1500 when he enters the system–before any results come in. That number is completely arbitrary. All that matters is the difference between player ratings, so if we started each competitor with 0, 100, or 888, the end result of those differences would remain the same.
When I began calculating Elo ratings, I kept with tradition and started every player with 1500. Since then, I’ve expanded my view to Challengers (and the women’s ITF equivalent) and tour-level qualifying. If we started each new player at those levels with 1500 points, it re-scales the entire system, which would have been confusing. Instead, I replaced 1500 with a number in the low 1200s (it depends a bit on tournament level and gender) so that the ratings would remain approximately the same.
At the moment, the ATP and WTA top-ranked players are Rafael Nadal and Ashleigh Barty, at 2203 and 2123, respectively. The best players are often in this range, and the very best often approach 2500. According to the most recent version of my algorithm, Djokovic’s peak was 2470, and Serena Williams’s best was 2473.
The 2000-point mark is a good rule of thumb to separate the elites from the rest. At the moment, six men and seven women have ratings that high. 16 men and 18 women have Elo ratings of at least 1900, and a rating of 1800 is roughly equivalent to a place in the top 50.
Era comparisons and Elo inflation
Once we attach a single peak rating to every player, it’s only natural to start comparing across eras. While it’s always fun to do so, I’m not sure any rating system allows for useful cross-era comparisons in tennis. Elo doesn’t, either.
What you can do with Elo is compare how each player fared against her competition. In 1990, Helena Sukova achieved a rating of 2123–exactly the same as Barty’s today. That doesn’t mean that Sukova then was as good as Barty is now. But it does mean that their performance relative to their peers was similar. The second tier of players was considerably weaker thirty years ago, so in a sense it was easier to achieve such a rating. At the time, Sukova’s rating was only good for 11th place, far behind Steffi Graf’s 2600.
Thus, Elo doesn’t allow you rank players across eras unless you are confident that the level of competition was similar–or unless you have some other way of dealing with that issue, a minefield that many researchers have tried to cross, with little success.
A related issue is Elo inflation or deflation, which can also complicate cross-era comparisons. Every time a match is played, the winner and loser effectively “trade” some of their points, so the total number of Elo rating points in the system doesn’t change. However, every time a new player enters the system, the total number of points increases. And whenever a player retires, the total number of points decreases.
It would be nice if additions and subtractions canceled each other out, but for many competitions that use Elo, they don’t. Additions tend to outweigh subtractions, so Elo ratings increase over time. That doesn’t appear to be the case with my tennis ratings, at least in part because of the penalty I’ve introduced for injury absences, but it does serve as a reminder that the number of points in the system changes over time, for reasons unrelated to the strength of the top players. (I’ll have more to say about the absence penalty below.)
Elo gives us a rating for every player, and we’re getting a sense of what we can and can’t do with them.
One of the main purposes of any rating system is to predict the outcome of matches–something that Elo does better than most others, including the ATP and WTA rankings. The only input necessary to make a prediction is the difference between two players’ ratings, which you can then plug into the following formula:
1 – (1 / (1 + (10^((difference) / 400))))
If we wanted to forecast a rematch of the last match of the Davis Cup Finals, we would take the Elo ratings of Nadal and Denis Shapovalov (2203 and 1947), find the difference (256), and plug it into the formula, for a result of 81.4%, Nadal’s chance of winning. If we used the negative difference (-256), we’d get 18.6%, Shapovalov’s odds of scoring the upset.
My version of tennis Elo is based on the most common match format, best-of-three matches. In a best-of-five match, the favorite has a better chance of winning. The math for converting best-of-three to best-of-five is a bit complicated, but for those interested, I’ve posted some code. The point is that an adjustment must be made. If the Nadal-Shapovalov rematch happens at the best-of-five Australian Open, Rafa’s 81.4% edge will increase to 86.7%.
Adjusting Elo for surface
For most sports, we could stop here. A match is a match, with only minor variations. In tennis, though, ratings and predictions should vary quite a bit based on surface.
My solution is a bit complicated. For each player, I maintain four separate Elo ratings: overall, hard court only, clay court only, and grass court only. I don’t differentiate between outdoor and indoor hard. For instance, Thiem’s ratings are 2066 overall, 1942 on hard, 2031 on clay, and 1602 on grass. (Surface ratings tend to be lower: Thiem’s clay rating is third-best on tour, miles ahead of everyone except for Nadal and Djokovic.)
These single-surface ratings tell us how we would rank players if we simply threw away results on every other surface. That’s not realistic, though. Single-surface ratings aren’t great at predicting match results. A better solution is to take a 50/50 blend of single-surface and overall ratings. If we wanted to predict Thiem’s chances in a clay-court match, we’d use a half-and-half mix of his 2066 overall rating and his 2031 clay-court rating. My weekly Elo reports show the single-surface ratings as “HardRaw” (and so on), and the blended ratings as “hElo,” “cElo,” and “gElo.”
There is no natural law that dictates a 50/50 blend. Every adjustment I’ve made to the basic Elo algorithm is determined solely by what works. (More on that below.) Initially, I suspected that a blend between single-surface and overall ratings would be appropriate, because a player’s success on one surface has some correlation with his success on others. I expected the blend to be different for each surface–perhaps using a higher percentage of the overall rating for grass, because there are fewer matches on the surface. In the end, my testing showed that 50/50 worked for each surface.
Ask some tennis fans which tournaments matches matter more–for rankings, for GOAT debates, whatever–and you can find yourself with a long, detailed list of what factors determine greatness. Maybe slams are more important than masters and premiers, though those are less important than tour finals and the Olympics, and of course finals are key, plus head-to-heads against certain players… you get the idea.
Elo provides for such adjustments. A coefficient usually referred to as the “k factor” allows us to give greater weight to certain matches. It’s common in Elo ratings for other sports, for example by using a higher k factor for postseason than regular season games. However, I’ve tested all sorts of different k factors for the likely types of “important” matches, and I’ve yet to find a tweak to the system that consistently improves its ability to predict match outcomes.
The absence penalty
There’s one exception. When players miss substantial amounts of time, I reduce their rating, and then increase the k factor for several matches after their return. I’ve explained more of the details in a previous post.
These steps are a logical extension of the Elo framework, especially when you consider our usual mental adjustments when a player misses time. If a player is injured for a few months, we never know quite what to expect when she returns. Maybe she’s as strong as ever; maybe she’s still a step slow. Perhaps she’ll return to normal quickly; she might never fully return to form. An extended absence raises a lot of questions. An injury player rarely returns in better form than when she left, while many players are worse upon return, giving us an average post-injury performance level that is worse than before the absence.
Therefore, when a player first returns, our estimate must be that she is worse. However, a few strong early results should be weighted more heavily–hence the higher k factor. The k factor reflects the fact that, immediately after an absence, we aren’t as confident as usual in our estimate.
The algorithm gets complicated, but the logic is simple. It’s basically just an attempt to work out a rigorous version of statements like, “I don’t know how well he’ll play when he comes back, but I’ll be watching closely.”
One side benefit of the absence penalty is that it counteracts Elo’s natural tendency toward ratings inflation. While more players enter the system than leave it, adding to the total number of available points, the penalty removes some points without re-allocating them to other players.
Validating Elo and adjustments
I’ve mentioned “testing” a few times, and I started this article with a claim that Elo is superior to the official ranking systems. What does that mean, and how do we know?
The simplest way to compare rating systems is a metric called “accuracy,” which counts correct predictions. There were 50 singles matches at the Davis Cup finals, and Elo picked the winner in 36 of them, for an accuracy rating of 72%. The ATP rankings picked the winner (in the sense that the higher-ranked player won the match) in 30 of them, for an accuracy rating of 60%. In this tiny experiment, Elo trounced the official rankings. Elo is also considerably better over the course of the entire season.
A better metric for this purpose is Brier score, which takes into account the confidence of each forecast. We saw earlier that Elo gives Nadal an 81.4% chance of beating Shapovalov. If Nadal ends up winning, 81.4% is a “better” forecast than, say, 65%, but it’s a “worse” forecast than 90%. Brier score takes the squared distance between the forecast (81.4%) and the result (0% or 100%, depending on the winner), and averages those numbers for all forecasted matches. It rewards aggressive forecasts that prove correct, but because it uses squared distance, it severely punishes predictions that are aggressive but wrong.
A more intuitive way to think about what Brier score is getting at is to imagine that Nadal and Shapovalov play 100 matches in a row. (Or, more accurately but less intuitively, imagine that 100 identical Nadals play simultaneous matches against 100 identical Shapovalovs.) A forecast of 81.4% means that we would expect Nadal to win 81 or 82 or those matches. If Nadal ends up winning 90, the forecast wasn’t Rafa-friendly enough. We’ll never get 100 simultaneous matches like this, but we do have thousands of individual matches, many of which share the same predictions, like a 60% chance of the favorite winning. Brier score aggregates all of those prediction-and-result pairs and spits out a number to tell us how we’re doing.
It’s tough to forecast the result of individual tennis matches. Any system, no matter how sophisticated, is going to be wrong an awful lot of the time. In many cases, the “correct” forecast is barely better than no forecast at all, if the evidence suggests that the competitors are equally matched. Thus, “accuracy” is of limited use–it’s more important to have the right amount of confidence than to simply pick winners.
All of this is to say: My Elo ratings have a much lower (better) Brier score than predictions derived from ATP and WTA rankings. Elo forecasts aren’t quite as good as betting odds, or else I’d be spending more time wagering and less time writing about rating systems.
Brier score is also the measure that tells us whether a certain adjustment–such as surface blends, injury absences, or tournament type–constitutes an improvement to the system. Assessing an injury penalty lowers the Brier score of the overall set of Elo forecasts, so we keep it. Decreasing the k factor for first-round matches has no effect, so we skip it.
Extending Elo to doubles
… and mixed doubles
Code for tennis Elo (in R, not written by me)
A good introduction to Brier score