Measuring the Performance of Tennis Prediction Models

With the recent buzz about Elo rankings in tennis, both at FiveThirtyEight and here at Tennis Abstract, comes the ability to forecast the results of tennis matches. It’s not far fetched to ask yourself, which of these different models perform better and, even more interesting, how they fare compared to other ‘models’, such as the ATP ranking system or betting markets.

For this, admittedly limited, investigation, we collected the (implied) forecasts of five models, that is, FiveThirtyEight, Tennis Abstract, Riles, the official ATP rankings, and the Pinnacle betting market for the US Open 2016. The first three models are based on Elo. For inferring forecasts from the ATP ranking, we use a specific formula1 and for Pinnacle, which is one of the biggest tennis bookmakers, we calculate the implied probabilities based on the provided odds (minus the overround)2.

Next, we simply compare forecasts with reality for each model asking If player A was predicted to be the winner (P(a) > 0.5), did he really win the match? When we do that for each match and each model (ignoring retirements or walkovers) we come up with the following results.

Model		% correct
Pinnacle	76.92%
538		75.21%
TA		74.36%
ATP		72.65%
Riles		70.09%

What we see here is how many percent of the predictions were actually right. The betting model (based on the odds of Pinnacle) comes out on top followed by the Elo models of FiveThirtyEight and Tennis Abstract. Interestingly, the Elo model of Riles is outperformed by the predictions inferred from the ATP ranking. Since there are several parameters that can be used to tweak an Elo model, Riles may still have some room left for improvement.

However, just looking at the percentage of correctly called matches does not tell the whole story. In fact, there are more granular metrics to investigate the performance of a prediction model: Calibration, for instance, captures the ability of a model to provide forecast probabilities that are close to the true probabilities. In other words, in an ideal model, we want 70% forecasts to be true exactly in 70% of the cases. Resolution measures how much the forecasts differ from the overall average. The rationale here is, that just using the expected average values for forecasting will lead to a reasonably well-calibrated set of predictions, however, it will not be as useful as a method that manages the same calibration while taking current circumstances into account. In other words, the more extreme (and still correct) forecasts are, the better.

In the following table we categorize the set of predictions into bins of different probabilities and show how many percent of the predictions were correct per bin. This also enables us to calculate Calibration and Resolution measures for each model.

Model    50-59%  60-69%  70-79%  80-89%  90-100% Cal  Res   Brier
538      53%     61%     85%     80%     91%     .003 .082  .171
TA       56%     75%     78%     74%     90%     .003 .072  .182
Riles    56%     86%     81%     63%     67%     .017 .056  .211
ATP      50%     73%     77%     84%     100%    .003 .068  .185
Pinnacle 52%     91%     71%     77%     95%     .015 .093  .172

As we can see, the predictions are not always perfectly in line with what the corresponding bin would suggest. Some of these deviations, for instance the fact that for the Riles model only 67% of the 90-100% forecasts were correct, can be explained by small sample size (only three in that case). However, there are still two interesting cases (marked in bold) where sample size is better and which raised my interest. Both the Riles and Pinnacle models seem to be strongly underconfident (statistically significant) with their 60-69% predictions. In other words, these probabilities should have been higher, because, in reality, these forecasts were actually true 86% and 91% percent of the times.3 For the betting aficionados, the fact that Pinnacle underestimates the favorites here may be really interesting, because it could reveal some value as punters would say. For the Riles model, this would maybe be a starting point to tweak the model.

In the last three columns Calibration (the lower the better), Resolution (the higher the better), and the Brier score (the lower the better) are shown. The Brier score combines Calibration and Resolution (and the uncertainty of the outcomes) into a single score for measuring the accuracy of predictions. The models of FiveThirtyEight and Pinnacle (for the used subset of data) essentially perform equally good. Then there is a slight gap until the model of Tennis Abstract and the ATP ranking model come in third and fourth, respectively. The Riles model performs worst in terms of both Calibration and Resolution, hence, ranking fifth in this analysis.

To conclude, I would like to show a common visual representation that is used to graphically display a set of predictions. The reliability diagram compares the observed rate of forecasts with the forecast probability (similar to the above table).

The closer one of the colored lines is to the black line, the more reliable the forecasts are. If the forecast lines are above the black line, it means that forecasts are underconfident, in the opposite case, forecasts are overconfident. Given that we only investigated one tournament and therefore had to work with a low sample size (117 predictions), the big swings in the graph are somewhat expected. Still, we can see that the model based on ATP rankings does a really good job in preventing overestimations even though it is known to be outperformed by Elo in terms of prediction accuracy.

To sum up, this analysis shows how different predictive models for tennis can be compared among each other in a meaningful way. Moreover, I hope I could exhibit some of the areas where a model is good and where it’s bad. Obviously, this investigation could go into much more detail by, for example, comparing the models in how well they do for different kinds of players (e.g., based on ranking), different surfaces, etc. This is something I will spare for later. For now, I’ll try to get my sleeping patterns accustomed to the schedule of play for the Australian Open, and I hope, you can do the same.

This is a guest article by me, Peter Wetz. I am a computer scientist interested in racket sports and data analytics based in Vienna, Austria.


1. P(a) = a^e / (a^e + b^e) where a are player A’s ranking points, b are player B’s ranking points, and e is a constant. We use e = 0.85 for ATP men’s singles.

2. The betting market in itself is not really a model, that is, the goal of the bookmakers is simply to balance their book. This means that the odds, more or less, reflect the wisdom of the crowd, making it a very good predictor.

3. As an example, one instance, where Pinnacle was underconfident and all other models were more confident is the R32 encounter between Ivo Karlovic and Jared Donaldson. Pinnacle’s implied probability for Karlovic to win was 64%. The other models (except the also underconfident Riles model) gave 72% (ATP ranking), 75% (FiveThirtyEight), and 82% (Tennis Abstract). Turns out, Karlovic won in straight sets. One factor at play here might be that these were the US Open where more US citizens are likely to be confident about the US player Jared Donaldson and hence place a bet on him. As a consequence, to balance the book, Pinnacle will lower the odds on Donaldson, which results in higher odds (and a lower implied probability) for Karlovic.

Can Nick Kyrgios Win a Grand Slam?

Today’s breaking news? Former Wimbledon finalist Mark Philippoussis thinks that Nick Kyrgios can win the Australian Open. Hey, it’s almost the offseason. We take our news wherever we can get it.

Still, it’s an interesting question. Is it possible for such a volatile, one-dimensional player to string together seven wins on one of the biggest stages in the sport? Philippoussis–not the most versatile of players himself–reached two Slam finals. A big serve can take you far.

Last year, I published a post investigating the “minimum viable return game,” the level of return success that a player would need to maintain in order to reach the highest echelon of men’s tennis. It’s rare to finish a season in the top ten without winning at least 38% of return points, though a few players, including Milos Raonic, have managed it. When I wrote that article, Kyrgios’s average for the previous 52 weeks was a measly 31.7%, almost in the territory of John Isner and Ivo Karlovic.

Kyrgios has improved since then. In 2016, he won 35.4% of return points, almost equal to Raonic’s 35.9%–and most would agree that Milos had an excellent year. Philippoussis’s career mark was only 34.9%, though Kyrgios would be lucky to play as many tournaments on grass and carpet as Philippoussis did. Still, a sub-36% rate of return points won isn’t usually good enough in today’s game: Raonic was only the third player since 1991 (along with Pete Sampras and Goran Ivanisevic) to finish a season in the top five with such a low rate.

Then again, Philippoussis didn’t say anything about finishing in the top five. The “minimum viable Slam-winning return game” might be different. Looking at all Grand Slam champions back to 1991, here are the lowest single-tournament rates of return points won:

Year  Slam             Player               RPW%                     
2001  Wimbledon        Goran Ivanisevic    31.1%  
1996  US Open          Pete Sampras        32.8%  
2009  Wimbledon        Roger Federer       33.7%  
2002  US Open          Pete Sampras        35.6%  
2000  Wimbledon        Pete Sampras        36.6%  
2010  Wimbledon        Rafael Nadal        36.8%  
2014  Australian Open  Stan Wawrinka       37.0%  
1998  Wimbledon        Pete Sampras        37.2%  
1991  Wimbledon        Michael Stich       37.4%  
2000  US Open          Marat Safin         37.5%

Wimbledon is well-represented here, as we might expect. Not so for Kyrgios’s home Slam: Stan Wawrinka‘s 2014 Australian Open title is the only time it appears in the top 20, even though it has played very fast in recent years. Every other Melbourne titlist won at least 39.5% of return points. As with year-end top-ten finishes, 38% is a reasonable rule of thumb for the minimum viable level, though on rare occasions, it is possible to come in below that.

The bar is set: Can Kyrgios clear it? 18 months ago, when Kyrgios’s 52-week return-points-won average was below 32%, the obvious answer would have been negative. His current mark above 35% makes the question a more interesting one. To win a Slam, he’ll probably need to return better, but only for seven matches.

The Australian has enjoyed one seven-match streak–in fact, a nine-match run–that would be more than good enough. Combining his title in Marseille and his semifinal showing in Dubai this Februrary, Kyrgios played almost nine matches (he retired with a back injury in the last one) while winning a whopping 41.5% of return points. At 42 of the last 104 Slams, the champion has won return points at a lower rate.

However, February was an aberration. To approximate Kyrgios’s success over the length of a Slam, I looked at his return points won over every possible streak of ten matches. (Most of his matches have been best-of-three, so ten matches is about the same number of points as a Slam title run.) Aside from the streaks involving Marseille and Dubai this year, he has never topped 37% for that length of time.

There’s always hope for improvement, especially for a mercurial 21-year-old in a sport dominated by older men. But the evidence is against him here, as well. Research by falstaff78 suggests that players do not substantially improve their return statistics as they mature. That may seem counterintuitive, since some players clearly do develop their skills. However, as players get better, they go deeper in tournaments and alter their schedules, changing the mix of opponents they face. Two years ago, Kyrgios faced seven top-20 players. This year he played 18. Raonic, who represents an optimistic career trajectory for Kyrgios, faced 26 this season.

Against the top 20–the sorts of Grand Slam opponents a player has to beat to get from the fourth round to the trophy ceremony–Kyrgios has won less than 30% of his career return points. Even Raonic, who has yet to win a Slam himself, has done better, and won 32.6% of return points against top-20 opponents this year.

There’s little doubt that Kyrgios has the serve to win Grand Slams. And once the Big Four retire, I suppose someone will have to win the majors. But even in weak eras, you need to break serve, and at Slams, you typically need to do so many times, and against very high-quality opponents. The evidence we have so far strongly implies that Kyrgios, like Philippoussis before him, will struggle to triumph at a Slam.

Shot-by-Shot Stats for 261 Grand Slam Finals (and More?)

One of my favorite subsets of the Match Charting Project is the ongoing effort–in huge part thanks to Edo–to chart all Grand Slam finals, men’s and women’s, back to 1980. We’re getting really close, with a total of 261 Slam finals charted, including:

  • every men’s Wimbledon and US Open final all the way back to 1980;
  • every men’s Slam final since 1989 Wimbledon;
  • every women’s Slam final back to 2001, with a single exception.

The Match Charting Project gathers and standardizes data that, for many of these matches, simply didn’t exist before. These recaps give us shot-by-shot breakdowns of historically important matches, allowing us to quantify how the game has changed–at least at the very highest level–over the last 35 years. A couple of months ago, I did one small project using this data to approximate surface speed changes–that’s just the tip of the iceberg in terms of what you can do with this data. (The dataset is also publicly available, so have fun!)

We’ve got about 30 Slam finals left to chart, and you might be able to help. As always, we are actively looking for new contributors to the project to chart matches (here’s how to get started, and why you should, and you don’t have to chart Slam finals!), but right now, I have a different request.

We’ve scoured the internet, from YouTube to Youku to torrent trackers, to find video for all of these matches. While I don’t expect any of you to have the 1980 Teacher-Warwick Australian Open final sitting around on your hard drive, I’ve got higher hopes for some of the more recent matches we’re missing.

If you have full (or nearly full) video for any of these matches, or you know of a (preferably free) source where we can find them, please–please, please!–drop me a line. Once we have the video, Edo or I will do the rest, and the project will become even more valuable.

There are several more finals from the 1980s that we’re still looking for. Here’s the complete list.

Thanks for your help!

The Grass is Slowing: Another Look at Surface Speed Convergence

A few years ago, I posted one of my most-read and most-debated articles, called The Mirage of Surface Speed Convergence.  Using the ATP’s data on ace rates and breaks of serve going back to 1991, it argued that surface speeds aren’t really converging, at least to the extent we can measure them with those two tools.

One of the most frequent complaints was that I was looking at the wrong data–surface speed should really be quantified by rally length, spin rate, or any number of other things. As is so often the case with tennis analytics, we have only so much choice in the matter. At the time, I was using all the data that existed.

Thanks to the Match Charting Project–with a particular tip of the cap to Edo Salvati–a lot more data is available now. We have shot-by-shot stats for 223 Grand Slam finals, including over three-fourths of Slam finals back to 1980. While we’ll never be able to measure anything like ITF Court Pace Rating for surfaces thirty years in the past, this shot-by-shot data allows us to get closer to the truth of the matter.

Sure enough, when we take a look at a simple (but until recently, unavailable) metric such as rally length, we find that the sport’s major surfaces are playing a lot more similarly than they used to. The first graph shows a five-year rolling average* for the rally length in the men’s finals of each Grand Slam from 1985 to 2015:


* since some matches are missing, the five-year rolling averages each represent the mean of anywhere from two to five Slam finals.

Over the last decade and a half, the hard-court and grass-court slams have crept steadily upward, with average rally lengths now similar to those at Roland Garros, traditionally the slowest of the four Grand Slam surfaces. The movement is most dramatic in the Wimbledon grass, which for many years saw an average rally length of a mere two shots.

For all the advantages of rally length and shot-by-shot data, there’s one massive limitation to this analysis: It doesn’t control for player. (My older analysis, with more limited data per match, but for many more matches, was able to control for player.) Pete Sampras contributed to 15 of our data points, but none on clay. Andres Gomez makes an appearance, but only at Roland Garros. Until we have shot-by-shot data on multiple surfaces for more of these players, there’s not much we can do to control for this severe case of selection bias.

So we’re left with something of a chicken-and-egg problem.  Back in the early 90’s, when Roland Garros finals averaged almost six shots per point and Wimbledon finals averaged barely two shots per point, how much of the difference was due to the surface itself, and how much to the fact that certain players reached the final? The surface itself certainly doesn’t account for everything–in 1988, Mats Wilander and Ivan Lendl averaged over seven shots per point at the US Open, and in 2002, David Nalbandian and Lleyton Hewitt topped 5.5 shots per point at Wimbledon.

Still, outliers and selection bias aside, the rally length convergence we see in the graph above reflects a real phenomenon, even if it is amplified by the bias. After all, players who prefer short points win more matches on grass because grass lends itself to short points, and in an earlier era, “short points” meant something more extreme than it does today.

The same graph for women’s Grand Slam finals shows some convergence, though not as much:


Part of the reason that the convergence is more muted is that there’s less selection bias. The all-surface dominance of a few players–Chris Evert, Martina Navratilova, and Steffi Graf–means that, if only by historical accident, there is less bias than in men’s finals.

We still need a lot more data before we can make confident statements about surface speeds in 20th-century tennis. (You can help us get there by charting some matches!) But as we gather more information, we’re able to better illustrate how the surfaces have become less unique over the years.

The Unalarming Rate of Grand Slam Retirements

Yesterday, Vitalia Diatchenko proved to be even less of a match for Serena Williams than expected. She retired down 6-0, 2-0, winning only 5 of 37 points. She also sparked the usual array of questions about how Grand Slam prize money–$39,500 for first-round losers–incentivizes players to show up and collect a check even if they aren’t physically fit to play.

Diatchenko wasn’t the only player to exit yesterday without finishing a match. Of the 32 men’s matches, six ended in retirement. On the other hand, none of those were nearly as bad. All six injured men played at least two sets, and five of them won a set.

The prominence of Serena’s first-round match, combined with the sheer number of Monday retirements, is sure to keep pundits busy for a few days proposing rule changes. As we’ll see, however, there’s little evidence of a trend, and no need to change the rules.

Men’s slam retirements in context

Before yesterday’s bloodbath, there had been only five first-round retirements in the men’s halves of this year’s Grand Slams. The up-to-date total of 11 retirements is exactly equal to the annual average from 1997-2014 and the same as the number of first-round retirements in 1994.

The number of first-round Slam retirements has trended up slightly over the last 20 years. From 1995 to 2004, an average of ten men bowed out of their first-round matches each year. From 2005 to 2014, the average was 12.2–in large part thanks to the total of 19 first-round retirements last season.

That rise represents an increase in injuries and retirements in general, not a jump in unfit players showing up for Slams. From 1995 to 2004, an average of 8.5 players retired or withdrew from Slam matches after the first round, while in the following ten years, that number rose to 10.8.

Retirements at other tour-level events tell the same story. At non-Slams from 1995-2004, the retirement rate was about 1.3%, and in the following ten years, it rose to approximately 1.8%. (There isn’t much of a difference between first-round and later-round retirements at non-Slams.)

Injury rates in general have risen–exactly what we’d expect from a sport that has become increasingly physical. Based on recent results, we shouldn’t be surprised to see more retirements in best-of-five matches, as most of yesterday’s victims would’ve survived to the end of a best-of-three contest.

Women’s slam retirements

In most seasons, the rate of first-round retirements in women’s Grand Slam draws is barely half of the corresponding rate in other tour events.

In the last ten years, just over 1.2% of Slam entrants have quit their first-round match early. The equivalent rate in later Slam rounds is 1.1%, and the first-round rate at non-Slam tournaments is 2.26%. Diatchenko was the fifth woman to retire in a Slam first round this year, and if one more does so today, the total of six retirements will be exactly in line with the 1.2% average.

One painful anecdote isn’t a trend, and the spotlight of a high-profile match shouldn’t give any more weight to a single data point. Even with the giant checks on offer to first-round losers, players are not showing up unfit to play any more often than they do throughout the rest of the season.

Teenagers, Thirty-Somethings, and Americans at Grand Slams

I’ve put together a few reports showing how age distributions and US presence have changed over the years at Grand Slams.  Let’s start with player age.

The average age of players in the Wimbledon men’s singles draw is 27.7 years, which is just short of the all-time record, 27.8, set at Roland Garros last month, and equal to last year’s figure at Wimbledon. There are two teens in the draw (up one from last year), and 34 thirty-somethings, which is tied for third-most since 1982.

This report shows the complete year-by-year breakdown for the last 30 years’ worth of men’s slam draws.

The average age in the Wimbledon women’s draw is also very high by historical standards.  At 25.2 years, it’s tied with this year’s French Open and 2012 Wimbledon for the highest ever.  43-year-old Kimiko Date Krumm moves the needle all by herself; without her, the average would be 25.0, still considerably higher than any other pre-2010 slam.

There are ten teenagers in the draw, which is very low for the WTA, but safely above the all-time low of 7, set at Wimbledon two years ago. The total of 16 players aged 30 or over is good for third-most of all time, behind this year’s and last year’s French Opens.

Here’s the WTA report showing these numbers for each slam in the last 30 years.

(All of the figures above for 2014 Wimbledon could change slightly if more lucky losers are added to the draw.)

I also put together a couple of reports showing the number of Americans in each slam draw, broken down by direct entrants, qualifiers, lucky losers, and wild cards, along with the top seed, the number of seeds (and top 16 seeds), plus the number of Americans in each round:



The Effect of 32 Seeds

In the middle of 2001, the Grand Slams doubled the number of seeds in the draw from 16 to 32, a change “designed to protect star players and satisfy clay and grass specialists.”

The intended beneficiaries of the change were, of course, all seeded players. Those in the top 16 no longer had to worry about facing a fellow top-32 player until the third round. Those ranked from 17-32, who before the change may have faced a top 16 player in the first round, now received the same protection.

The costs of the 32-seed system are borne by two groups: unseeded players, who are now more likely to face a top-ranked player early; and first-week fans, who would like to see more “compelling” early-round matches. While it’s easy to point to shock upsets like Serena Williams’s exit today as a counterpoint, the first two rounds at Slams often feel like warm-up matches for the biggest stars, with fringe players as their hapless foils.

On the other hand, it’s tough to get an intuitive sense of just how much is at stake here. It may not be as much as you think. From 1989 to 2000, men’s seeds were upset 263 times in the first two rounds of slams. Only 51 of those losses were to players in the top 32. In other words, more than 80% of those upsets would have occurred even with a 32-seed format, and presumably, some of the remaining 51 matches would still have resulted in upsets.

From the perspective of the top 16 seeds, there may not be that much difference between opponents ranked in the next 16 and those ranked lower still. To cherry-pick just one example, there are many seeded players Stanislas Wawrinka would have rather faced this week than Guillermo Garcia Lopez.

For top-four women, it hasn’t made a difference at all. In the twelve years before the switch, they reached the third round in 176 of 190 attempts. In the twelve years after the format change, women seeded 1-4 no longer risked facing a top-32 player in the first two rounds, and reached the third round in 178 of 191 attempts.

In fact, for top-16 women’s seeds in general, the 32-seed format has not helped. From 1989-2000, women’s seeds reached the third round 77.6% of the time, the fourth round 63.5% of the time, and the quarterfinals 40.8% of the time. From 2002-13, with lower-ranked early-round opponents, the corresponding numbers were 78.2%, 60.1%, and 37.1%.

It’s likely that some of the differences have to do with the increasing depth of the women’s game, but it’s hardly the case that the 32-seed format has drastically changed the nature of the majors, at least for the players who have been seeded all along. Men’s top-16 seeds have benefited, reaching the third, fourth, and quarterfinal rounds about 10% more often since the switch to 32 seeds, but even here, we’re not seeing radically different second weeks.

The real change, as you might suspect, appears when we consider the balance of power between the new seeds (17-32) and the rest of the field. From 1989-2000, when there were only 16 seeds and those two groups were treated the same way, men’s players ranked 17-32 reached the third round about twice as often (35% to 17%) as their lower-ranked competitors. Women in the 17-32 range held a wider advantage of 39% to 15%.

Now that there are 32 seeds and the 17-32 group is protected, those gaps have substantially grown. From 2002-13, men seeded outside the top 16 have reached the third round 53% of the time, compared to 12% for unseeded players. Seeded women in the 17-32 range have reached the third round 49% of the time, while unseeded women have equaled their male counterparts at 12%.

These differences, big as they are, aren’t going to affect most fans’ enjoyment of the majors. The format change means that Rafael Nadal faces a player ranked 60th in the world in the second round and a player ranked 30th in the third round. He’ll almost always win both matches, so the end result is the same. A surprise run to the quarterfinals isn’t much different if it’s made by world #25 than by #50.

However, the 32-seed format does amplify the gap between tennis’s haves and have-nots. Yes, he Grand Slams have massively increased prize money in the last few years for all main-draw competitors–first-round losers in Paris earn more than $32,000 for their efforts. But players who reach the third round are able to triple that money.

As we’ve seen, the format change has made it much more likely that #32 reaches the third round (and takes home a nearly six-figure purse) at the expense of everyone ranked lower–despite having little effect on the makeup of the field in the fourth round and beyond.  Plus, the ranking points on offer at Slams mean that third-rounders are that much more likely to earn a seed at the next major, starting the next round of the same cycle.

Seeding 32 players instead of 16 doesn’t have much of an effect on the fates of top players, especially on the women’s side. It can, however, lessen interest in the first several days of play, and it certainly supports an arbitrary middle tier of players at the expense of the rest of the field.

If the 32-seed era were to end here, there’s little reason for tennis fans to miss it.