The Unbreakable and Record-Setting Cincinnati Finalists

When Roger Federer and Novak Djokovic met in the Cincinnati final on Sunday, they represented a unique event in tennis history: Neither one had been broken.  Four matches each, no breaks of serve.

That’s not just a Masters-level record, it’s a first for the ATP tour, at least since 1991, the time span for which point-level stats are available.    That’s over 1500 tournaments, including nearly 200 Masters events.

It’s very rare to even come close.  Of the 195 Masters tournaments for which data is available, only four pairs of finalists entered the title match with three or fewer breaks.  Djokovic leads the pack: When he met Rafael Nadal in the 2011 Miami final, Nadal had been broken once, Djokovic not at all.  When Djokovic and Federer met in the 2007 Montreal final, each player had only been broken once.  The Miami achievement is particularly notable because each player had won five pre-final matches, compared to only four each in Cincinnati and Montreal.

Federer set some records on his own, as well.  By holding his serve against Djokovic, he made it through an entire Masters tournament without suffering a break.  That’s the first time it has ever happened at this level.  Eight other times the winner has only been broken once–twice that winner was Federer, including Cincinnati two years ago.  Ten additional times, the winner was only broken twice–and Roger is responsible for three of those.

At lower level tournaments, it’s somewhat more common–the winner of a non-Masters event has made it through without losing serve a total of 17 times.  Surprise, surprise: Two of those are Federer, at Doha in 2005 and Halle in 2008.  Four other men have done it twice: Andy Roddick, Joachim Johanssen, Richard Krajicek, and Ivan Ljubicic.  Milos Raonic did it earlier this year in Chennai.

Federer set at least one more record last week, and it might be the most impressive of all.  He only faced three break points all week–the lowest known total at a Masters tournament.  The previous record was four, set by Andre Agassi at the 2002 Madrid Masters.  Fed’s total in Cinci was only the 10th ever in single digits–and Roger is now responsible for four of those top ten results.

At lower-level events, Fed’s mark has been bettered a couple of times.  At the 2007 Memphis tournament, Tommy Haas claimed the trophy without facing a single break point.   At San Jose this year, Raonic faced only two break points, though Tobias Kamke converted one of them.  Two other players–Andy Murray at 2009 Queen’s Club and Roddick at Lyon in 2005–got through an event facing only three break points.

No breaks, and record-settingly few break points. If hard courts are truly becoming slower, it seems that someone forgot to tell Roger.

The Implications of the 10-Point Tiebreak

Italian translation at settesei.it

I’m not sure how we got here, but we now live in a world where a lot of people consider a 10-point tiebreak equivalent to a set.  Apparently it’s more fan-friendly and better for television.  And of course it’s faster.

Whatever its practical uses, it’s obvious that the first-to-10 breaker isn’t the same as a set.  I’ll leave the moral debate to others; let’s take a statistical approach.

In general, the more points (or games, or sets) required to win a match, the more likely it is that the better player wins.  Some commentators have taken to calling the 10-point breakers “shootouts,” and for good reason.  Reduce the number of points required to win, and you increase the role played by luck.

Of course, sometimes a shootout is the best idea.  You’ve got to end a match somehow, and when players end up equal after two sets, four sets, or four sets and twelve games, it’s all the more likely that luck will have to intervene.  But the structure of the match determines just how much luck is permitted to play a part.

To compare a 10-point tiebreak with the set it replaces, we need to know how much more luck it introduces into the game.  For that, we need an example to work with.

Take two players: Player A wins 70% of points on serve, and Player B wins 67% of points on serve.  Playing best of three tiebreak sets, Player A has a 63.9% chance of winning the match.

If A and B split sets, A’s probability of winning falls to 59.3%.  In other words, the shorter time frame makes it more likely that B gets lucky, or is able to put together an unusually good run of play long enough to win the match.

If the match is decided by a 10-point tiebreak, however, A’s probability of winning falls all the way to 56.0%, erasing more than one-third of the favorite’s edge in the third set.  In fact, the 10-point breaker is barely more favorable to A than a typical 7-pointer, in which A would have a 55.1% chance.

(If you like playing around with this stuff, see my python code to calculate tiebreak odds.)

Somehow I don’t think anyone would advocate replacing the deciding set with a 7-point tiebreak.  Yet a 10-point tiebreak is much closer to its 7-point cousin than it is to a full set.

Adding a few more points doesn’t resolve the discrepancy, either.  To maintain Player A’s 59.3% chance of winning, the third set would have to be replaced by a 26-point tiebreak.  But that, I’m sure, wouldn’t attract many new advertisers.

Does Cincinnati Matter in Flushing?

After months of clay and grass tournaments, the best players on tour are finally competing on hard courts.  For many, Cincinnati is the extent of their North American hard court preparation leading up to the US Open.  No matter who wins this week, we’ll be tempted to anoint him the favorite in New York.  Should we?

Traditionally, Cincinnati features one of the strongest draws of the ATP season.  As the only tournament scheduled two weeks before the US Open, there are no alternatives for players preparing for the slam, and it still allows a week off.  This year’s draw, missing three top 10 players due to injury, is an aberration.

It’s no surprise, then, that the list of winners in Cincinnati is particularly impressive.  19 of the last 20 champions have career peak rankings of 1 or 2.  (The black sheep in the group is Thomas Enqvist, who “only” reached #4.)  Not only do the best in the world show up to play, they show up to win.

More than some warmups, Cincinnati seems to tell us who is in form.  Let’s see if tells us who is going to win the Open.

Since 1991, there have been four seasons when the same man lifted the trophy in Cincinnati and New York: Pat Rafter in 1998, Andy Roddick in 2003, and Roger Federer in 2005 and 2007.  Five more times, the Cincinnati winner reached the US Open final.  Not counting 1999, when Pete Sampras didn’t compete in Flushing, the Cincinnati champion has failed to reach the US Open round of 16 only twice in the last 21 years.

So, the Cincinnati winner has won the US Open about 20% of the time, and reached the final another 25%.  Sounds good, though not as good as we’d expect from the top seed.  On the other hand, Cincinnati winners aren’t always the top seed in New York, so we can’t expect them to perform at that level.

In fact, the Cincinnati winner has been the top seed in Flushing only six times.  On average, the Cinci champion has been seeded 4th in New York.  Compared to the performance we’d expect from a #4 seed, a 20% shot at winning the tournament, along with a nearly 1-in-2 chance of reaching the final, is extremely good.

Since 1991, #4 seeds at the US Open haven’t performed nearly so well during the final weekend as have Cincinnati champions.  Both groups have a roughly 6-in-10 chance of reaching the semis (#4 seeds: 57.1%, Cinci winners: 60%), but the #4 seeds have won only half of their semifinals, for a 28.5% chance of reaching the final, compared to the 45% of Cincinnati titlists.

The biggest difference is where it matters most: the final itself.  Cincinnati winners go on to win almost half of their US Open finals, winning 4 titles in 20 attempts, as we’ve seen.  But #4 seeds have won only 2 titles.  It’s not a huge sample, but if we expand our view to consider all four slams since 1991, the performance of #4 seeds stays about the same.

Much to my surprise, it seems that Cincinnati results do have something to say about the final rounds in Flushing.  This week’s winner isn’t exactly a lock to triumph in New York, but his performance in Ohio will tell us to expect that much more from him at the US Open.

How Good is Brian Baker?

In his remarkable comeback this year, Brian Baker has already recorded two top-20 scalps, along with seven other victories against players in the top 100.   In the same span of six months, he’s also lost to a player barely inside the top 400, and suffered another six defeats against guys outside the top 100.

This is inconsistency of historic magnitude.  The list of players he’s beaten may actually be more impressive than the list of those who have beaten him!  Adding to the confusion, we don’t have any other recent results from him.  We can’t just wave our hands and point to his 2011 performance level as an accurate indicator of his current level.

One measurement of player ability, the ATP ranking system, places him at #78, a number that seems just as ridiculous when he’s beating Philipp Kohlschreiber at a Masters event as when he’s losing to Maxime Authom at a challenger.  But overall, the ATP estimate doesn’t seem too far-fetched.  It’s certainly better than what jrank (my rating system) spits out.  That algorithm doesn’t know what to do with such a limited track record, so it places him far outside the top 100.

We can do better.  As we’ll see, Baker’s results suggest he belongs on the cusp of the top 50.

Uniquely limited results

Imagine a completely unknown player is given a wild card into a major event.  We don’t know where he came from or who he might have beaten in the past.  He’s a completely blank slate.  If we wanted to estimate his ability level, we would have to wait until we got some results.

If that player won an opening-round match against the 17th-best player in the world, our best guess would be that he is better than #17, but we wouldn’t know how much.  If he lost that opening round match, we would assume he is worse than #17.  We might use statistics from that match to estimate how much better or worse than #17.

As our unknown kept playing more matches, we would update our estimate, using additional data as it came in.

(You might protest that in the early going, we should regress our estimate to the mean, since if some random guy came out of nowhere, he probably isn’t one of the 16 best tennis players in the world–there was a reason he was nowhere.  And, in such a real-world scenario, you would be right.  But such a case, what is the mean?  If a baseball player is called up from Triple-A, an intelligent observer, such as a scout or team executive, considers him at least marginally MLB-level, so we would regress our estimate to the level of marginal MLB players.  But if a player receives a wild card into a tennis tournament, what do we know?)

Few tennis players in history have come closer to this unknown than Brian Baker.  Sure, everyone has to start somewhere, but usually “somewhere” is a long string of futures tournaments, followed by an even longer string of challengers.  By the time a player bags his first top-20 scalp, we have lots and lots of data to work with.

When other players were racking up several dozen matches every year, Brian Baker was rehabbing injuries and coaching college tennis.  We can only judge him based on a small number of recent results.  And those results are particularly contradictory.

Working backward

Intuitively, it’s tough to accept that a single player has beaten a bunch of good players and lost to several weaker ones.  No matter how good that guy is, such a set of outcomes is unlikely.

But how unlikely?  That question is the key to estimating Baker’s current level.

Rather than assuming Baker is playing at a certain level (like that of #78) and scratching our heads at his inconsistency, we can work backwards–take his results and determine the likelihood that he is playing at various levels.

For instance, we could assume that Baker is #5 in the world.  If so, some of his results would be very predictable (like the two wins against Blake Strode) and others would be particularly jarring.  We could go further and calculate the probability that the #5 player in the world would amass Baker’s specific match record.  Those odds, of course, are vanishingly small.

If you repeat the process for every possible ranking, you get a probability that #5, or #12, or #77 would win the matches Baker has won and lose the matches he has lost.  One of those probabilities will be higher than the others, and that’s our best guess of how highly we should regard the American.

(If you’re interested in methodology, click “Continue Reading” below.)

Using this method, we discover that Baker has played at the level of someone with about 820 ATP ranking points, putting him around #54, in a tight pack with Grigor Dimitrov, Gilles Muller, Alejandro Falla, and Lukas Lacko.  With every match he plays, we can continue to fine-tune our estimate.

There are many factors we need to ignore to do an analysis like this, largely because of the limited data that led us to the topic in the first place.  Many of Baker’s worst results have come on hard courts; perhaps he will prove over a longer period to be stronger on clay and grass.  If his ability level has changed over the last six months, as seems very likely, this approach fails to take it into consideration.

But because of the unique nature of Baker’s comeback, which makes it difficult to assume anything about his ability level–this approach allows us to a make a reasonably good guess.  And with such a strange mix of great wins and rough losses, a good guess is all we can hope for.

Continue reading How Good is Brian Baker?

Tommy Haas: Old and Winning

For all the talk of 30-somethings at the top of the modern men’s game, tennis players decline quickly.  30 may be the new 20, but 35 is still the same old 35, and 35-year-old tennis players are usually found on the champions tour, the doubles court, or national television.

Yet Tommy Haas, aged 34 years and 5 months, is enjoying a resurgence, having reached three finals in the last two months–on three different surfaces.  He’s one of the hottest players on tour of any age.

34-year-olds don’t do things like that.  In the last ten years, players 34 and older have accounted for fewer than 1% of wins on the ATP tour.  From 2008 to 2011, all 34-year-olds–combined–won a total of 17 tour-level matches.  In the five months since his birthday, Haas has won 22.

To find a point of comparison, we need to go back five years, to the 2007 campaign of Fabrice Santoro, and slightly earlier, to Andre Agassi‘s 2004 season.  Agassi at 34 was better than Haas at 34, winning 37 tour-level matches and reaching two grand slam quarterfinals.  Agassi was the best “old” player since Jimmy Connors and the only man in the discussion since the 1970s.

Yet already, Haas is among the best 34-and-overs in ATP history.  His 22 wins since his 34th birthday are good for 28th on the all-time list, ahead of Fred Stolle and just behind Roy Emerson.  But that understates Haas’s accomplishment.  With the exceptions of Santoro, Agassi, and Connors (whose 178 wins-past-34 are good for 2nd on the all time list, behind Ken Rosewall), everyone on the list retired more than 20 years ago.

Comparisons to Haas’s contemporaries do a better job of illustrating how unusual he is.  The only two older men to have won a match on tour this year are Arnaud Clement and Ruben Ramirez Hidalgo, neither of whom are a factor anywhere but the challenger tour.  The other 34-year-old to win some matches this season is hyper-fit warrior Michael Russell, who took advantage of the weak draws in Atlanta and Los Angeles.

As long as he stays healthy, Haas is far from finished.  According to Jrank, he’s the 11th-best hard court player in the game right now. He may not have another grand slam final ahead of him, as Agassi did at the same age, but he has more wins in his future than most players a decade his junior.

The Hangover Effect of a Marathon Fifth Set

Italian translation at settesei.it

Marathon sets are again the talk of tennis.  We won’t soon forget Roger Federer’s 19-17 third-set win over Juan Martin Del Potro … or Roger’s weak performance in the match that followed.

The unusual Olympic format–best-of-three, no final-set tiebreak–brought several issues to the fore.  Should best of three be enough for slams?  It certainly gave us plenty of dramatics last week.  And is it finally time to end the no-tiebreak madness?  For all of the occasional drama, do we really need to see even more service holds in John Isner matches?

Peter Bodo makes the case for a marathon-free world:

[M]y main reason for embracing the final-set tiebreaker is not the obvious one that would be cited by most time-sensitive television producers. The real problem with deuce sets is that when a match goes as long as Federer v. Delpo or even Jo-Wilfried Tsonga v. Milos Raonic (that one went 25-23, for Tsonga) the reward for the winner’s heroic feat is almost always a quick subsequent loss.

As Bodo goes on to illustrate, this seems anecdotally true.  But who cares about anecdotes?  This is a testable hypothesis.

As we’ll see, there is a noticeable hangover effect when a player has fought through a marathon fifth set.  But the alternative–a fifth-set tiebreak–produces nearly the same hangover.

There have been 146 marathon fifth sets–matches in which the final set reached 6-6–in Grand Slam tennis since the beginning of 2001.  The record of those 146 winners in their next round is dreadful: 43-103, or 29.5%.  It’s even worse than that, actually.  Four times, two marathon men went on to play each other, so four of those wins were inevitable.

However, that isn’t the end of the story.  To prove that fifth-set marathons significantly weaken their winners, we need to establish two things: (1) They had a decent shot at beating their next opponents anyway, and (2) if a fifth-set tiebreak were played, their chances would have been better.

Post-marathon underdogs

The first issue is a bit sneaky.  If a player has to go deep into the fifth set to win in the early rounds, he’s hardly a dominating presence in the draw.  Consider the extreme case of Yen Hsun Lu, who in 2010, beat Andy Roddick in a 9-7 fifth set, advancing to play Novak Djokovic in the Wimbledon quarters.  Sure, Lu was tired, but what were the odds of an upset even if Roddick lost in three?  Top players rarely need five hours to push through an early-round opponent.

To quantify this, we can turn to jrank-driven predictions.  Using these measures of each player’s ability level at the time of the match, we can estimate the actual chances of our 146 marathon men.

The marathon men would have been underdogs in their next match no matter what.  On average, each one had a 43.4% chance of winning, meaning that of the 146 matches, they should have won 63 of them.  Even adjusting for their underdog status, they seem to have suffered from their marathons–they won 43 of those matches, barely two-third the number that they “should” have won.

Almost-but-not-quite marathons

We’ve established that once a player enters the uncharted territory beyond 6-6, his chances of winning the next match are substantially weakened.  But surely the fatigue didn’t set in right at the moment the chair umpire called “6-6.”  Even if the fifth set is a bagel, simply playing five sets of professional-level tennis is exhausting, and might impact one’s performance a day or two later.

The most relevant set of matches for comparison are US Open five-setters that went to a final-set tiebreak.  Since 2001, we have 40 of those.  In their next matches, the winners of the almost-marathons went a dismal 11-29 (27.5%)–worse than the marathon men!

Compared to their expectations, though, they did a bit better.  Those forty men, on average, had a 38% chance of winning their next matches, meaning we would expect them to win about 15 of the 40.  Relative to the predictions we would have made at the time, this small sample of fifth-set-tiebreak winners outperformed the marathon men, but just barely.

For a bigger sample, we can turn to the slightly shorter–but still epic–matches that end 7-5 in the fifth.  Of the 95 such matches since 2001, the 7-5 winners went on win 49, or 51.5% of their next matches!  This despite the fact they were collective underdogs, expected to win only 48%, or 46 of those matches.

What now?

Since the 7-5 group performed so differently in their next matches, it’s tempting to speculate why they did so.  My best guess: If a player manages a break before the set goes 6-6, he’s relatively fresh, physically and mentally.  The sort of player who can break at 5-5 or 6-5 is one who can come back a day or two later and plow through another three or four hard-fought sets.

By contrast, matches that get to 6-6–whether they end in a tiebreak or not–are usually battles of attrition.  Think Isner-Mahut: The longer it lasted, the less likely either player could challenge the other’s serve.  That brand of tennis had set in before 6-6 in the fifth: If one of the players pulled out a 7-4 tiebreak, it wouldn’t say much about his fitness or mental stamina, simply that someone is bound to get lucky for a point or two.

Based on the limited data we have, there just isn’t much difference between the after-effects of fifth-set marathons and fifth-set tiebreaks.  In both cases, the marathon men weren’t going to be favored anyway, and their fatigue hurts them even more.  Changing format to fifth-set tiebreaks would have little effect on future outcomes–it would just make those matches a bit more dependent on a lucky bounce.

The Tournament Simulation Reference

Italian translation at settesei.it

Among the more popular features of Heavy Topspin are my tournament forecasts, based on draw simulations.  It’s about time that I summarize how these work.

Monte Carlo simulations

To generate tournament predictions, we first need a way to predict the outcome of individual matches.  For that, I use jrank, which I’ve written about elsewhere.  With numerical estimates of a player’s skill–not unlike ATP ranking points–we can calculate the probability that each player wins the match.

Once those matchup probabilities are calculated, it’s a matter of “playing” the tournament thousands upon thousands of times.  Here, computers come in awfully handy.

My code (a version of which is publicly available) uses a random-number generator (RNG) to determine the winner of each match.  For instance, at the top of the Rogers Cup draw this week, Novak Djokovic gets a bye, after which he’ll play the winner of Bernard Tomic‘s match with Michael Berrer.  My numbers give Tomic a 64% chance of beating Berrer.  To “play” that match in a simulated tournament, the RNG spits out a number between 0 and 1.  If the result is below .64, Tomic is the winner; if not, Berrer wins.

The winner advances to “play” Djokovic.  The code determines Djokovic’s probability of beating whoever advances to play him, then generates a new random number to pick the winner.  Repeat the process 47 times–one for each match–and you’ve simulated the entire tournament.

Each simulation, then, gives us a set of results.  Perhaps Tomic reaches the second round, losing to Djokovic, who then loses in the quarters to Juan Martin Del Potro, who goes on to win the tournament.   That’s one possibility–and it’s more likely than many alternatives–but it doesn’t tell the whole story.

That’s why we do it thousands (or even millions) of times.  Over that many simulations, Delpo occasionally wins, but somewhat more often, Djokovic wins that quarterfinal showdown.  Tomic usually reaches the second round, but sometimes it’s Berrer into the second round.  All of these “usually’s” and “sometimes’s” are converted into percentages based on just how often they occur.

Probability adjustments

For any given pair of players, we don’t always expect the same outcome.  Pablo Andujar is almost always the underdog on hardcourts, but we expect him to beat most mid-packers on clay.  Players perform (a bit) better in their home country.  Qualifiers do worse than equivalent players who didn’t have to qualify.

Thus, if we take last week’s Washington field and transplant it to the clay courts of Vina Del Mar, the numbers would change a great deal.  Americans and hard-court specialists would see their chances decrease, while Chileans and clay-courters would see theirs increase–just as conventional wisdom suggests would happen.

Simulation variations: Draw-independence

Some of the more interesting results come from messing around with the draw.  Every time a field is arranged into a bracket, there are winners and losers.  Whoever is drawn to face the top seed in the first round (or second, as Berrer and Tomic can attest) is probably unlucky, while somewhere else in the draw, a couple of lucky qualifiers get to play each other for a spot in the second round.

That’s one of the reasons I sometimes run draw-independent simulations (DIS).  If we want to know how much the draw helped or hurt a player, we need to know how successful he was likely to be before he was placed in the draw.  (DISs are also handy if you know the likely field, but the draw isn’t yet set.)

To run a draw-independent sim, we have to start one step earlier.  Instead of taking the draw as a given, we take the field as a given, including the seedings if we know them.  Then we use the same logic as tournament officials will use in constructing the draw.  The #1 seed goes at the top, #2 at the bottom.  #3 and #4 are randomly placed in the remaining quarters.  #5 through #8 are randomly placed in the remaining eighths, and so on.

(Update: I’ve published a python function, reseeder(), which generates random draws for any combination of number of seeds and field size that occurs on the ATP tour.)

Simulation variations: Seed-independence

We can take this even further to measure the beneficial effect of seeding.  Most of the time we take seeding for granted–we want the top two players in the world to clash only in the final, and so on.  But it can have a serious effect on a player’s chances of winning a tournament.  In Toronto this week, the top 16 seeds (along with, in all likelihood, a very lucky loser or two) get a bye straight into the second round.  That helps!

Even when there are no byes, seedings guarantee relatively easy matches for the first couple of rounds.  That may not make a huge difference for someone like Djokovic–he’ll cruise whether he draws a seeded Florian Mayer or an unseeded Jeremy Chardy.  But if you are Mayer, consider the benefits.  You’re barely better than some unseeded players, but you’re guaranteed to miss the big guns until the third round.

This is why we talk so much about getting into the top 32 in time for slams.  When the big points and big money are on the line, you want those easy opening matches even more than usual.  There isn’t much separating Kevin Anderson from Sam Querrey, but if the US Open draw were held today, Anderson would get a seed and Querrey wouldn’t.  Guess who we’d be more likely to see in the third round!

To run a seed-independent simulation: Instead of generating a logical draw, as we do with a DIS, generate a random draw, in which anyone can face anyone in the first round.

Measuring variations

If we compare forecasts based on the actual draw to draw-independent or seed-independent forecasts, we want to quantify the difference.  To do so, I’ve used two metrics: Expected Ranking Points (ERP) and Expected Prize Money (EPM).

Both reduce an entire tournament’s worth of forecasts to one number per player.  If Djokovic has a 30% chance of winning this week in Toronto, that’s the probability he’ll take home 1,000 points.  If those were the only points on offer, his ERP would be 30% of 1,000, or 300.

Of course, if Djokovic loses, he’ll still get some points.  To come up with his overall ERP, we consider his probability of losing the finals and the number of points awarded to the losing finalist, his probability of losing in the semis and the number of points awarded to semifinalists, and so on.  To calculate EPM, we use the same process, but with–you guessed it–prize money instead of ranking points.

Both numbers allow to see how much the draw helps or hurts a player.  For instance, before the French Open, I calculated that Richard Gasquet‘s EPM rose by approximately 25% thanks to a very lucky draw.

These numbers also help us analyze a player’s scheduling choices.  The very strong Olympics field and the much weaker Washington field last week created an odd situation: Lesser players were able to rack up far more points than their more accomplished colleagues. Even before the tournament, we could use the ERP/EPM approach to see that Mardy Fish could expect 177 points in Washington while the far superior David Ferrer could expect only 159 in London.

If you’ve read this far, you will probably enjoy the newest feature on TennisAbstract.com–live-ish forecast updates for all ATP events.  Find links on the TA.com homepage, or click straight to the Rogers Cup page.

Serving First in Marathon Sets

Italian translation at settesei.it

Last night, when Jo Wilfried Tsonga finally defeated Milos Raonic, it was on a match-ending break of serve.  Conventional wisdom suggests that’s often how it goes.  Whoever serves first in a long set seems to have the advantage.  There’s less pressure to hold serve at 7-7 (or 47-47) than there is at 7-8.

Tsonga won his contest with a match-ending break point; Isner finished off his 70-68 set on Mahut’s serve; and when Federer and Roddick went to 14-14 in the 2009 Wimbledon final, Roger held for 15-14 before breaking the American.  Is it a trend?

As it turns out, those three high-profile matches have misled us.  Based on the limited data available, the first server in fifth-set epics has little or no advantage.

(Third-set epics are so rare that we might as well ignore them–the Olympics is the only tournament where men play best-of-three with no tiebreak in the final set.)

We don’t know who served first for every marathon fifth set in tennis history, but we can figure it out for some.  The ATP has limited stats for most matches back to 1991, and those stats include numbers of service games.  When the number of service games is equal for both players, we’re stuck at square one.  When one player has more than the other, that guy must have served the first game of the match–and the last.  Since marathon sets must contain an even number of games, we know who served first in the final set.

The result is a pool of 138 matches in which the fifth set ended at 8-6 or higher and we know who served first.  Of those, the guy who served first–at 0-0, 1-1, 6-6, and so on–won the match 67 times (48.6%).  It’s a coin toss.

If we take pressure out of the equation, this makes perfect sense.  If two guys have gotten to 6-6 in the fifth set, they’re playing as equally as two tennis players can play.  It’s only when we consider the stress of serving to stay in the match that we start to suspect that one player–but not the other–won’t be able to hold up his end.

For a bigger dataset, we can look to similar situations.  Consider 5-setters that end 7-5 in the fifth.  Those don’t have the cachet of matches that go farther, but they are quite epic in their own right.  We know who served first in 86 such matches, and of those, the man who served first won only 38 (44.2%).  It’s not exactly proof that the first server has a disadvantage, but it does cast more doubt on the conventional wisdom.

If want more than 200 or so matches, we need to weaken our definition of “epic.”  Tiebreaks aren’t relevant here, since we’re looking for instances where one player was broken under pressure.  But we can use best-of-three contests that ended 7-5.

With so many more best-of-three matches on the schedule, our dataset is now much bigger.  We know who served first for 753 tour-level matches that ended 7-5 in the third.  Of these, the player who served first went 412-341, winning nearly 55% of matches.

If you want evidence that the conventional wisdom is correct, there you go.  If a match reaches 5-5 in the deciding set and ends with a break, there is, altogether, a 53% chance that the first server wins.

But with our more limited data, it’s impossible to draw the same conclusion about five-setters once they head into the barely-charted territory beyond 6-6.

2012 Olympics Round of 16 Forecasts

Here are my forecasts for the remaining 16 players in both Olympics singles draws.  Note that Djokovic has opened up a bigger gap over Federer.  Novak is aided by Berdych’s upset, while Federer is still likely to play the top seeds in his half.

On the women’s side, the third quarter is a crowded one, with Clijsters, Sharapova, and two dangerous floaters in Ivanovic and Lisicki.

For more background, you can see my initial forecasts, (almost) current rankings, and methodology.

Men:

Player                       QF     SF      F      W  
(1)Roger Federer          85.3%  64.5%  45.1%  25.7%  
Denis Istomin             14.7%   5.0%   1.5%   0.3%  
(10)John Isner            53.5%  16.9%   7.5%   2.4%  
(7)Janko Tipsarevic       46.5%  13.5%   5.6%   1.7%  
(4)David Ferrer           63.3%  36.3%  16.2%   6.7%  
(15)Kei Nishikori         36.7%  16.0%   5.2%   1.6%  
(12)Gilles Simon          32.3%  11.7%   3.3%   0.8%  
(8)Juan Martin Del Potro  67.7%  36.0%  15.5%   6.2%  

Player                       QF     SF      F      W  
Steve Darcis              39.5%   8.9%   1.5%   0.3%  
(11)Nicolas Almagro       60.5%  18.1%   4.2%   1.3%  
Marcos Baghdatis          22.7%  11.9%   2.7%   0.7%  
(3)Andy Murray            77.3%  61.1%  29.8%  16.4%  
(5)Jo-Wilfried Tsonga     67.5%  23.3%  12.0%   5.4%  
Feliciano Lopez           32.5%   6.9%   2.4%   0.7%  
(WC)Lleyton Hewitt         4.6%   0.6%   0.1%   0.0%  
(2)Novak Djokovic         95.4%  69.3%  47.3%  29.7%

Women:

Player                 QF     SF      F      W  
Victoria Azarenka   78.9%  53.3%  28.2%  18.0%  
Nadia Petrova       21.1%   7.9%   1.9%   0.6%  
Venus Williams      16.8%   2.5%   0.3%   0.1%  
Angelique Kerber    83.2%  36.3%  14.8%   7.6%  
Serena Williams     75.9%  56.2%  36.9%  26.2%  
Vera Zvonareva      24.1%  11.5%   4.4%   1.9%  
Daniela Hantuchova  36.2%   9.1%   2.9%   1.1%  
Caroline Wozniacki  63.8%  23.2%  10.6%   5.3%  

Player                 QF     SF      F      W  
Kim Clijsters       62.5%  33.2%  20.3%   8.9%  
Ana Ivanovic        37.5%  15.4%   7.4%   2.5%  
Sabine Lisicki      36.8%  15.7%   7.7%   2.5%  
Maria Sharapova     63.2%  35.6%  22.2%  10.0%  
Petra Kvitova       65.5%  45.7%  23.9%  10.2%  
Flavia Pennetta     34.5%  18.9%   7.0%   1.9%  
Maria Kirilenko     47.5%  16.2%   5.0%   1.2%  
Julia Goerges       52.5%  19.3%   6.6%   1.8%

Who Benefits From Byes?

Italian translation at settesei.it

Roughly two-thirds of ATP tour-level tournaments have byes in the draw.  31 events–including the two this week, in Kitzbuhel and Los Angeles–have 28-man fields, with first-round byes for the top four seeds.

The obvious beneficiaries are the top four seeds.  They get free passes into the second round, eliminating the chance they’ll be handed a first-round exit.  It’s also a guarantee of greater prize money and more ranking points.  First-round byes are such a feature of the ATP tour, at least in part, because they help smaller tournaments convince big-name players to sign up.

Of course, you can’t simply hand an advantage to the top four seeds without affecting others.  In this most common format, a 28-man field with eight seeds and four byes, there are three important groups: The top four seeds, the bottom four seeds, and the rest of the field.

The top four seeds: The main effect of byes on the top four seeds is that, as noted, they don’t have play first-round matches.  The extent of that effect depends on how much of a threat the first-rounder would’ve been.

To quantify these effects, I ran simulations for the 2012 Estoril tournament.  First, I simulated the draw as the tournament was played, with 28 players and top seeds of Juan Martin Del Potro, Richard Gasquet, Stanislas Wawrinka, and Albert Ramos.  Second, I added the next four players on the alternate list to the draw in place of the byes.  To eliminate any bias stemming from the specific arrangement of the draw, I re-generated the brackets for each simulation.

In the 32-man field, Delpo won his first round match about 90% of the time, Gasquet and Wawrinka about 80%, and Ramos just under 60%.  Accordingly, Delpo didn’t benefit too much from the bye, but Ramos gained enormously.

However, when measured by expected ranking points, none of these four men gained as much as skipping the first round would suggest.  For instance, if Delpo would win only 90% of his first-round matches, removing that impediment would be expected to raise his other outcomes by (1/0.9 – 1), or 11%.  In fact, in the 28-man simulation, he gained only 9.5% over his 32-man expectation.

The slight difference is due to the other top seeds.  If Delpo is more likely to reach, say, the semifinals, then the same effect applies to Gasquet and Wawrinka, the two men who would be most likely to knock him out of the draw.  So while the bye itself increases Delpo’s expected ranking points by 11%, the increased probability of facing the other top seeds reduces it a bit.

Still, the net effect on the top four seeds is overwhelmingly positive.  For Gasquet and Wawrinka, the bye itself increases their expectations by 27% each, for a net effect of 24%, while for Ramos, the bye is a 74% increase, resulting in a net effect of 70%.

The next four seeds: The men seeded five through eight are the losers.  They must play a first-round match–which, in the Estoril example, they each have about a 60% chance of winning–but they are more likely to face one of the top four seeds later on.

The average effect of adding byes to the draw is a 5% decrease in expected ranking points for these lower four seeds.  They aren’t guaranteed to reach the quarterfinal, but in the 28-man version, if they do reach the quarters, they are at least 10% more likely to face a higher-ranked opponent.

The rest of the pack: Nearly everyone else benefits.  The effect of byes touches unseeded players in two ways, which work in opposite directions.  First, and most significantly, no one has to play a top-four seed in the first round.  In Estoril, the toughest first round opponent was 5th-seed Denis Istomin, not exactly a fearsome name in the locker room.  Because of the byes, nearly every player has a 40% chance of reaching the second round.

The countervailing force is a minor one–not enough to neutralize the advantage of missing top seeds in the first round.  When the field shrinks from 32 to 28, the average opponent is a bit better.  If four additional players were added to the Estoril field, they wouldn’t be automatically placed in the positions of the byes.  They would be randomly placed in the draw like everyone else.  Having those four lower-ranked players would give some players even easier first-round matches.

But on balance, for unseeded players, the goal is simply to win a match or two.  The best way to increase their chances of doing so is to keep the best players out of their path for as long as possible.  Byes take care of that.  The net benefit to unseeded players is an addition of 1% to 3% of their expected ranking points.  Generally speaking, the worse the player, the bigger the benefit.

The one exception to this rule is if an unseeded player is actually better than some of the seeds.  According to jrank, Igor Andreev was a better player than 8th-seed Flavio Cipolla going into Estoril.  Thus, the logic that applies to the bottom seeds applies to him.  He was likely to advance to the quarterfinals, so the effect of the byes was mainly to give him a tougher quarterfinal opponent.  In each tournament, this might affect one or two players–in Estoril, Andreev was the only one.

One more consideration: As we’ve seen, 23 of the 28 players benefited from the byes.  And the five players who were negatively affected didn’t lose too much.  How is that possible?

There’s one more group we haven’t talked about: The four players who aren’t included in a 32-man draw.  They don’t have much of a chance of reaching the final rounds, but they wouldn’t be much worse than the rest of the unseeded pack.

One of the players I used for this example, Igor Sijsling, just missed the cut, but in a 32-man draw, he would have been expected to take home 23 ranking points and about $9,000.  By adding four byes, the tournament is essentially taking what it would have given to Sijsling and three other players and divvying it up among the remaining 28.  The pie is the same size, but fewer players can claim a slice.

In the end, those four “missing” players are the only real losers, and they always have the option to head to a challenger for the chance of picking just as many points, even if they probably don’t come with as many dollars.

The winners, beyond the top seeds and the tournament organizers, are ultimately the fans.  When top players have more reason to play small tournaments, we get to watch more high-profile matchups, and ATP 250s look a bit less like Kitzbuhel and a bit more like Doha.