It’s one thing to predict a winner–it’s another thing to quantify how likely a player is to become that winner.

In most tennis tournaments, it’s not hard to pick a favorite. For most of the last year, it was Novak Djokovic, no matter the surface or who he might face. Before that, it was Federer on hard courts, Nadal on clay courts. While every one likes to identify a dark horse, there’s rarely much debate at the top.

Given that agreement, though, what odds would you have placed on Novak Djokovic winning Wimbledon? Or the French? Or an in-form Federer winning the tour finals over an injured Djokovic and a tired Nadal? Usually, my numbers spit out something between 20 and 30 percent–in theory, even the best player in the tournament has a better than two-thirds chance of going home a loser.

Intuitively, this is difficult to believe. Djokovic seemed so dominant for much of the year that his slam victories felt like foregone conclusions. Anyone who watched Novak on a good day found it impossible to imagine anyone outplaying him. When Carl Bialik wrote a column asking whether Djokovic could keep up his dominance for the entire season, most responses were some variation of “What are you, stupid? Numbers are irrelevant when someone is so good.”

But, all good things must come to end, and a combination of injuries and good opponents proved that even Djokovic is human.

That said, Djokovic’s dominance–and Nadal’s before him, and Federer’s before him–raises questions about forecasting tennis matches. The questions are complicated, but rest easy: today’s attempt at an answer will be simple.

**Do the rules apply to the very best?**

My ranking and forecasting system starts by assigning a number to every player, not unlike ATP ranking points. To keep things simple, let’s use ranking points. If we want to predict the outcome of, say, Mardy Fish against Feliciano Lopez, we take their point totals (2965 and 1755) and divide one by the sum of the others: 2965/(2965+1755) = 62.8%. (It’s a little more complicated than that, but not much.) Setting aside concerns like home court advantage and surface, that sounds about right to me.

Do the same with Djokovic and Lopez, and you get 88.6%. Work the numbers with Djokovic and world #100 Michael Berrer, and you get 96.0%. That’s pretty dominant, suggesting that Berrer would win only 1 in 25 matchups, but wait a minute–we’re saying Berrer’s going to beat Djokovic, *ever*?

And therein lies the problem. The formulas I use to generate points and generate predictions are reasonably accurate, tested against years of ATP results. And in the aggregate, individual match percentages pass the smell test. But at the extremes, the numbers seem questionable.

And it is at the extremes where the exact percentages matter the most. Consider my pre-tournament predictions for Wimbledon this year. While Nadal was the top seed, I picked Djokovic as the favorite, giving him a 21.6% chance of winning. But look at those first few rounds: I gave him only an 87% chance of getting past Jeremy Chardy (*Jeremy Chardy!)* in the first round, then only an 88% chance of beating Kevin Anderson or Ilya Marchenko, then only an 85% chance of winning against (probably) Marcos Baghdatis.

Only the last of those three numbers is plausible. And when combined, they meant that I gave Djokovic less than a 65% chance of reaching the round of 16. With all due respect to myself, that was almost as ridiculous then as it it sounds now.

It’s those early-round numbers that result in such minute chances that the favorite will win the tournament. Even if we give a player a 90% chance of winning *all* his matches, he’ll still only win the seven consecutive matches required for a grand slam 48% of the time. Lower it to 80%, and we’re down to 21% for the tournament. Since the odds of winning a semifinal match against the likes of Murray, Federer, or Nadal is probably much lower, it seems that early round odds should be much more favorable.

To summarize, one of two things is going on here. Either (1) my numbers underestimate the likelihood that the pre-tournament favorite wins a grand slam; or (2) our intuition overestimates the likelihood that the favorite takes home the trophy.

**Forecasting for dummies**

One way to pick between the two is to look at the recent past. Are pre-tournament favorites winning more or less than expected?

For now, let’s set aside the question of the likelihood that Djokovic beats Chardy or Marchenko, and look only at winning the tournament. We’re going to make two major assumptions here: (1) it’s possible to identify the pre-tournament favorite years later, and (2) favorites are generally created equal–Djokovic towers over his competitors to the same degree that Courier, or Lendl, or Sampras, or Federer towered over his. As usual, both of these assumptions probably aren’t true, but they aren’t so hideously wrong that they’ll stop us from reaching some worthwhile conclusions.

There are three easy ways of picking the pre-tournament favorite for a grand slam: using (a) the winner of the last slam; (b) the defending champion, and (c) the top seed–almost always the world #1. The top seed is probably best, while the defending champion might identify a player who is particularly good on the surface, and the winner of the last slam might pick out someone who is riding a hot streak.

The last 21 years (back to 1991, inclusive), give us 84 slams to work with. Our sample is a bit smaller than that, because occasionally the winner of the last slam or the defending champion did not play, and on three occasions, the top seed pulled out before the tournament began. Here is how the favorites did:

- Of the 75 players who had won the previous slam, 18 (24%) won the tournament.
- Of the 76 defending champions, 26 (34%) won the tournament.
- Of the 81 top seeds, 29 (36%) won the tournament. If we exclude the French (where the top seed is often #1 on the basis of hard court performance), we get a more dramatic result here–26 of 60 (43.3%) won the tournament.

All of these measures are much higher than the 21.6% shot I gave Djokovic at Wimbledon. And most are higher than the 27-28% chances I gave him at the French and US Open. The 43.3% likelihood that the top seed wins a hard-court slam (thank you, Pete and Roger!) suggests that a more sophisticated measure of identifying the favorite might allow us to predict slam champions with, say, 40% accuracy.

40% is considerably higher than my models are spitting out right now, but I suspect it is much lower than many fans imagine for their favorite. It suggests that, at the extremes, my predictions aren’t quite one-sided enough. It might take Michael Berrer more than 25 chances before he finally catches Djokovic on a bad day.

Your relatively low probabilities of a particular favourite winning a tournament are just the flip side of our disproportionate surprise when a favourite gets up-ended by someone much less famous. Yet, on a given day, someone like Baghdatis, Lopez, or Granollers may outplay the best in the world. The human mind likes things simple and predictable, and this bias leads us to exaggerate the dominance of the top players.

Right. What’s interesting is that while each individual human mind likes things simple and predictable (e.g. I love Del Potro, so I imagine him coasting to victory in Melbourne), fan polls rarely look so clear. The poll on stevegtennis asking people to predict the winner of the AO has Fed at 31%, no one else over 20% … a distribution that doesn’t look much different from my slam forecasts.

[ http://www.stevegtennis.com/vote.htm — looks like you have to vote to see the results.]

But as you point out, very few individual fans would say, “I think my guy is a favorite, and he has a 30% chance of winning.”

Yes, I feel the same way about Del Potro, which makes it very hard to swallow when he meekly gets his butt kicked by someone like Blake or Gulbis. But what your odds tell us is that it is always quite possible.

Have you thought about tweaking your rankings systems to be more in line with actual probabilities? I know you said that your rankings represent the data in the aggregate, but then you just laid out nicely how they didn’t. I guess I’m a little confused on that point. Is there a better way of predicting actual probabilities than simply using gambling odds (adjusted to remove the vig)? And if you’re data is a bit off from what the actual probabilities are, then what was the data you were using when coming up with your ranking system?

I guess I’m just thinking that such probabilities would be pretty easy to come up with. You know, how often does the #1 ranked player in the world actually lose to the #97th ranked player? Wouldn’t this be fairly easy to figure out?

Also, I love your blog. I’m so glad I stumbled upon this. Sooo cool.

Ideally, there’d be a single formula. #1 is rated at, say, 10,000 points, and #100 is rated at 500 points. Apply some formula to those two numbers and get a probability. Then, the same formula would work for the point totals for #10 vs #20, or #1 vs. #15, or #50 vs #70.

The problem is that the best single formula (that I’ve discovered) is very good for everything BUT the top few versus guys below, say, #20. For #1 vs #97, in your example, there would be very little *specific* data for that matchup. You could figure out how players have done, but if the historical record is 6-0 for #1, does that mean #1 has a 100% chance? 97% 80% and #1 has gotten lucky? To answer those questions you need to look at broader groups and aim for a more general solution.

That said, I have made some adjustments. Part of the problem is that favorites win more often in best-of-5 matches (see http://heavytopspin.com/2012/08/30/the-five-set-advantage-2/ ), and I wasn’t accounting for that. So obviously that affects forecasting slams. I’ve also tweaked the model somewhat to give the top 10 higher probabilities. That ruins the ‘one formula’ ideal, but it does get us closer.