GOAT Races: Forecasting Future Slams With a Monkey

After Novak Djokovic won his 16th career major at Wimbledon this year, more attention than ever focused on the all-time grand slam race. Roger Federer has 20, Rafael Nadal has 18, and Djokovic is–by far–the best player in the world on the surface of the next two slams. This is anybody’s ballgame.

Forecasting tennis is hard, and that’s just if you’re trying to pick the results of tomorrow’s matches. Players improve and regress seemingly at random, making it difficult to predict what the ranking table will look like only a few months from now. Fans love to speculate about which of the big three will, in the end, win the most slams, but there are an awful lot of unknowns to contend with.

One can imagine some way to construct a crystal ball to get these numbers in a rigorous way. Consider each player’s age, his likely career length, his chances of injury, his recent performance at each of the four slams, his current ranking, the quality of the field on each surface, and probably more, and maybe you could come up with some plausible numbers. Or… what if we skip most of that, and build the simplest model possible?

Enter the monkey

Baseball statheads are familiar with the Marcel projection system, named after a fictional monkey because it “uses as little intelligence as possible.” Just three years of results and an age adjustment. It isn’t perfect, and there are plenty of “obvious” improvements that it leaves on the table. But as in tennis, baseball stats are noisy. For most purposes, a “basic” forecasting system is as good as a complicated one, and over the years, Marcel has outperformed a lot of models that are considerably more complex.

Let’s apply primate logic to slam predictions. First, I’m going to slightly re-cast the question to something a bit more straightforward. Instead of forecasting “career” slam results, we’re going to focus on major titles over the next five years. (That should cover the big three, anyway.) And in keeping with Marcel, we’ll use just a few inputs: slam semi-finals, finals, and titles for the last three years, plus age. Actually, we’re going to lop off a bit of the monkey’s brain right away, because slam results from three years ago aren’t that predictive. So our list of inputs is even shorter: two years of slam semi-finals, finals, and titles, plus age.

The resulting model is pretty good! For players who have reached a major semi-final in any of the last eight slams, it predicts 40% of the variation in next-five-years slam titles. Without building the hyper-complex, optimal model, we don’t know exactly how good that is, but for a forecast that extends so far into the future, capturing almost half of the player-to-player variation in slam results sounds good to me. Think of all the things we don’t know about the slams in 2022, let alone 2024: who is still playing, who gets hurt, who has improved enough to contend, which prospects have come out of nowhere, and so on. Point being, the best model is going to miss a lot, so we shouldn’t set our standards too high.

Follow the monkey

The two-years-plus-age algorithm is so simple that you can literally do it on the back of an envelope. For any player, count his semi-final appearances (won or lost), final appearances (won or lost), and titles at the last four slams, then do the same for the previous four. Then note his age at the start of the next major. Start with zero points, then follow along:

  • add 15 points for each semi-final appearance in the last four slams
  • add 30 points for each final appearance in the last four slams
  • add 90 points for each title in the last four slams
  • add 6 points for each semi-final appearance in the previous four slams
  • add 12 points for each final appearance in the previous four slams
  • add 36 points for each title in the previous four slams
  • if the player is older than 27, subtract 8 points for each year he is older than 27
  • if the player is younger than 27, add 8 points for each year he is younger than 27
  • divide the sum by 100

That’s it! Let’s try Djokovic. In the last four majors, he’s won three titles and made one more semi-final. In the four before that, he won one title. He’ll enter the US Open at 32 years of age. Here goes:

  • +60 (15 points for each of his four semi-finals in the last four slams)
  • +90 (30 points for each of his three finals in the last four slams)
  • +270 (90 points for each of his three titles in the last four slams)
  • +6 (6 points for his 2017 Wimbledon semi-final)
  • +12 (12 points for his 2017 Wimbledon final)
  • +36 (36 points for his 2017 Wimbledon title)
  • -40 (Novak is 32, so we subtract 8 points for each of the 5 years he is older than 27)

Add it all up, and you get 434. Divide by 100, and we’re predicting 4.34 more slams for Novak.

Next-level GOAT trolling

I promise, I went about this project solely as a disinterested analyst. I just wanted to know how accurate a bare-bones long-term slam forecast could be. My goal was not to make you tear your hair out. But hey, you were probably going to lose your hair anyway.

Here is the number of slams that the model predicts for the big three between the 2019 US Open and 2024 Wimbledon:

  • Djokovic: 4.34
  • Nadal: 2.22
  • Federer: 0.26

You probably don’t need me to do the math for the next step, but you know I can’t not do it. Projected career totals:

  • Djokovic: 20.34
  • Federer: 20.26
  • Nadal: 20.22

Or, since we live in a world where you can’t win fractional majors:

  • Djokovic: 20
  • Federer: 20
  • Nadal: 20

Ha.

Back to the model

Djokovic’s forecast of 4.34 is quite high, in keeping with a player who has won three of the last four majors. For each year since 1971, I calculated a slam prediction for every player who had made a major semi-final in the previous two years–a total of more than 800 forecasts. Only 14 of those forecasts were higher than 4.34, and several of those belonged to the big three. Here are the top ten:

Year  Player         Age   Predicted  Actual     
2008  Roger Federer   26        6.38       5     
2007  Roger Federer   25        5.86       7     
2016  Novak Djokovic  28        5.20       6  *  
2005  Roger Federer   23        4.91      11     
2011  Rafael Nadal    24        4.89       5     
2006  Roger Federer   24        4.86      10     
2017  Novak Djokovic  29        4.79       4  *  
2012  Novak Djokovic  24        4.68       8     
1989  Mats Wilander   24        4.65       0     
1988  Ivan Lendl      27        4.56       2 

* actual slam counts that could still increase

All of these predictions are based on data available at the beginning of the named year. So the top row, 2008 Federer, is the forecast for Federer’s 2008-12 title count, based on his 2006-07 performance and his age entering the 2008 Australian. Had the model existed back then, it would have guessed he’d win a half-dozen slams in that time period. He came close, winning five.

There will be plenty of noise at the extreme ends of any model like this. At the beginning of 2005, the algorithm pegged Federer to win “only” five of the next twenty majors. Instead, he won 11. I can’t imagine any data-based system would have been so optimistic as to guess double digits. On the flip side, the 1989 edition of the monkey would’ve been nearly as hopeful for Mats Wilander, who was coming off a three-slam campaign. Sadly for the Swede, a gang of youngsters overtook him and he never made another major final.

Let’s also take a look at the next 10 rosiest forecasts, plus the current guesstimate for Djokovic:

Year  Player          Age  Predicted  Actual     
2010  Roger Federer    28       4.48       2     
1981  Bjorn Borg       24       4.47       1     
1996  Pete Sampras     24       4.47       6     
1975  Jimmy Connors    22       4.45       2     
Curr  Novak Djokovic   32       4.34       0  *  
1980  Bjorn Borg       23       4.28       3     
2013  Novak Djokovic   25       4.24       7     
2009  Roger Federer    27       4.20       4     
1995  Pete Sampras     23       4.16       7     
2009  Rafael Nadal     22       4.12       8     
1979  Bjorn Borg       22       4.09       5 

Plenty more noise here, with outcomes between 0 and 8 slams. Still, the average result of the 10 other predictions on this list is 4.5 slams, right in line with our forecast for Novak.

Missing slams…

The model expects that the big three will win around seven of the next twenty slams. You might reasonably wonder: What about the other thirteen?

The monkey only considers players with a slam semi-final in the last eight majors, so the forecasts shouldn’t add up to 20. There’s a chance that the champions in 2023 and 2024 aren’t yet on our radar, and many young names of interest to pundits these days, like Alexander Zverev, Felix Auger Aliassime, and Daniil Medvedev, haven’t yet reached the final four of a major. Here are the players for whom we can make predictions:

Player                 Predicted Slams  
Novak Djokovic                    4.34  
Rafael Nadal                      2.22  
Dominic Thiem                     0.71  
Stefanos Tsitsipas                0.63  
Hyeon Chung                       0.38  
Lucas Pouille                     0.31  
Kyle Edmund                       0.30  
Roger Federer                     0.26  
Juan Martin del Potro             0.19  
Marco Cecchinato                  0.06  
----------------                  ----  
TOTAL                             9.40 

(The five other players with semi-final appearances since the 2017 US Open are forecast to win zero slams.)

Yeah, I know, Lucas Pouille and Hyeon Chung aren’t really better bets to win a slam than Federer is. But they are (relatively) young, and the model recognizes that many players who reach slam semi-finals early in their careers are able to build on that success.

More to the point, we’re leaving a lot of majors on the table. If the overall forecast is correct, that list of players will win fewer than half of the next 20 slams, leaving at least ten championships to players who have yet to win a major quarter-final.

…and age

Remember, I retro-forecasted every five-year period back to 1971-75. Over the 44 five-year spans starting each season between 1971 and 2014, the model typically predicted that the players it knew about–the ones who had reached slam semi-finals in the last two years–would win 13 of the next 20 slams. In fact, those on-the-radar players combined to win an average of 12 majors in the ensuing five-year spans.

Only in the last few years has the total number of predicted slams fallen below 10. The culprit is age: Recall that every forecast has an age adjustment, and we subtract 8 points (0.08 slams) for each year a player is older than 27. That’s a 0.4-slam penalty for both Djokovic and Nadal, and it’s 0.8 slams erased from Federer’s future tally. Thus, the model predicts that the big three are fading, and there aren’t many youngsters (like Pouille and Chung) on the list to compensate.

How you interpret these big three forecasts in light of the “missing” slams depends on a couple of factors:

  • Has the aging curve for superstars has changed? Is 30 the new 25; 32 the new 27?
  • Will the next few generations of players soon be good enough to topple the big three?

There’s plenty of evidence that the aging curve has changed, that we should expect more from 30-somethings these days than we did in the 1980s and 1990s. That would close much of the gap. Let’s say we set the new peak age at 31, four years later than the men’s Open Era average of 27. That would add 0.32 slams to every player’s forecast, possibly adding one more slam to each of the big three’s forecasted total. Overall, it would add a bit more than an additional three slams to the total of the the previous table, putting that number close to the historical average of 13.

Shifting the age adjustment doesn’t disentangle the big three, though, because it affects them all equally. It just means a three-way tie at 21 is a bit more likely than a three-way tie at 20.

The second question is the more important–and less predictable–one. It’s hard enough to know how well a single player will be competing in three, four, or five years. (Or, sometimes, tomorrow.) But even if we could puzzle out that problem, we’d be left with the still more difficult task of predicting the level of competition. Entering the 2003 season, the monkey would have opined that the then-current crop of stars–men who made slam semis in 2001 and 2002–would account for a combined 13 majors between 2003 and 2007. That included 2.5 for Lleyton Hewitt, plus one apiece for Thomas Johansson, Albert Costa, Pete Sampras, Marat Safin, David Nalbandian, and Juan Carlos Ferrero. Those seven men won only two. The entire group of 20 players who merited forecasts entering the 2003 Australian Open won only three.

We’ll probably never establish exactly how strong that group was in comparison with other eras. What we know for sure is that none of those men were as good as Federer in 2003-05, and by the end of the five-year span, they’d been shunted aside by Nadal as well. (Only Nalbandian ranked in the 2007 year-end top ten.) The generation of Zverev/Tsitsipas/Auger-Aliassime/etc won’t be as good as peak Big Four, but the course of the next 20 slams will depend a lot more on those players that it will on the (relatively) more predictable career trajectories of Djokovic, Federer, and Nadal.

So we’re left with a stack of known unknowns and error bars wider than a shanked Federer backhand. But based on what we do know, the top of the all-time slam leaderboard is going to get even more crowded. At least, that’s what the monkey says.