The Myth of the Tricky First Meeting

Today, both Roger Federer and Stan Wawrinka will play opponents they’ve never faced before. In Federer’s case, the challenger is Steve Darcis, a 31-year-old serve-and-volleyer playing in his 22nd Grand Slam event. Wawrinka will face Hyeon Chung, a 19-year-old baseliner in only his second Slam draw.

For all those differences, both Federer and Wawrinka will need to contend with a new opponent–slightly different spins, angles, and playing styles than they’ve seen before.  In the broadcast introduction to each match, we can expect to hear about this from the commentators. Something along the lines of, “No matter what the ranking, it’s never easy to play someone for the first time. He’s probably watched some video, but it’s different being out there on the court.”

All true, as even rec players can attest. But does it matter? After all, both players are facing a new opponent. While Darcis, for example, has surely watched a lot more video of Federer than Roger has of him, isn’t it just as different being out on the court facing Federer for the first time?

Attempting to apply common sense to the cliche will only get us so far. Let’s turn to the numbers.

Math is tricky; these matches aren’t

Usually, when we talk about “tricky first meetings,” we’re referring to these sorts of star-versus-newcomer or star-versus-journeyman battles. When two newcomers or two journeymen face off for the first time, it isn’t so notable. So, looking at data from the last fifteen years, I limited the view to matches between top-ten players and unseeded opponents.

This gives us a pretty hefty sample of nearly 7,000 matches. About 2,000 of those were first meetings. Even though the sample is limited to matches since 2000, I checked 1990s data–including Challengers–to ensure that these “first meetings” really were firsts.

Let’s start with the basics. Top-tenners have won 86.4% of these first meetings. The details of who they’re facing doesn’t matter too much. Their record when the new opponent is a wild card is almost identical, as is the success rate when the new opponent came through qualifying.

The first-meeting winning percentage is influenced a bit by age. When a top-tenner faces a player under the age of 24 for the first time, he wins 84.6% of matches. Against 24-year-olds and up, the equivalent rate is 88.0%. That jibes with what we’d expect: a newcomer like Chung or Borna Coric is more likely to cause problems for a top player than someone like Darcis or Joao Souza, Novak Djokovic‘s first-round victim.

The overall rate of 86.4% doesn’t do justice to guys like Federer. As a top-tenner, Roger has won 95% of his matches against first-time opponents, losing just 8 of 167 meetings. Djokovic, Rafael Nadal, and Andy Murray are all close behind, each within rounding distance of 93%.

By every comparison I could devise, the first-time meeting is the easiest type of match for top players.

The most broad (though approximate) control group consists of matches between top-tenners and unseeded players they have faced before. Favorites won 76.9% of those matches. Federer and Djokovic win 91% of those matches, while Nadal wins 89% and Murray 86%. In all of these comparisons, first-time meetings are more favorable to the high-ranked player.

A more tailored control group involves first-time meetings that had at least one rematch. In those cases, we can look at the winning percentage in the first match and the corresponding rate in the second match, having removed much of the bias from the larger sample.

Against opponents they would face again, top-tenners won their first meetings 85.1% of the time. In their second meeting, that success rate fell to 80.2%. It’s tough to say exactly why that rate went down–in part, it can be explained by underdogs improving their games, or learning something in the first match–but to make a weak version of the argument, it certainly doesn’t provide any evidence that first matches are the tough ones.

It may be true that first matches–no matter the quality of the opponent–feel tricky. It’s possible it takes more time to get used to first-time opponents, and that those underdogs are more likely to take a first set, or at least push it to a tiebreak. That’s a natural thing to think when such a match turns out closer than expected.

Whether or not any of that is true, the end result is the same. Top players appear to be generally immune to whatever trickiness first meetings hold, and they win such contests at a rate higher than any comparable set of matches.

Certainly, Fed fans have little to worry about. Most of his first-meeting losses were against players who would go on to have excellent careers: Mario Ancic, Guillermo Canas, Gilles Simon, Tomas Berdych, and Richard Gasquet.

His last loss facing a new opponent was his three-tiebreak heartbreaker to Nick Kyrgios in Madrid, only his third first-meeting defeat in a decade. As a rising star, Kyrgios fits the pattern of Fed’s previous first-meeting conquerors. Darcis, however, looks like yet another opponent that Federer will find distinctly not tricky.

Will the US Open First-Round Bloodbath Benefit Serena Williams?

After only two days of play, the US Open women’s draw is a shell of its former self.

Ten seeds have been eliminated, only the fifth time in the 32-seed era that the number of first-round upsets has reached double digits. Four of the top ten seeds were among the victims, marking the first time since 1994 that so many top-tenners failed to reach the second round of a Grand Slam.

Things are particularly dramatic in the top half of the draw, where Serena Williams can now reach the final without playing a single top-ten opponent. In a single day of play, my (conservative) forecast of her chances of winning the tournament rose from 42% to 47%, only a small fraction of which owed to her defeat of Vitalia Diatchenko.

However, plenty of obstacles remain. Serena could face Agnieszka Radwanska or Madison Keys in the fourth round, and then Belinda Bencic–the last player to beat her–in the quarters. A possible semifinal opponent is Elina Svitolina, a rising star who took a set from Serena at this year’s Australian Open.

The first-round carnage didn’t include most of the players who have demonstrated they can challenge the top seed. Five of the last six players to beat Serena–Bencic, Petra Kvitova, Simona Halep, Venus Williams, and Garbine Muguruza–are still alive. Only Alize Cornet, the 27th seed who holds an improbable .500 career record against Serena, is out of the picture.

What’s more, early-round bloodbaths haven’t, in the past, cleared the way for favorites. In the 59 majors since 2001, when the number of seeds increased to 32, the number of first-round upsets has had little to do with the likelihood that the top seed goes on to win the tournament.

In 18 of those 59 Slams, four or fewer seeds were upset in the first round. The top seed went on to win five times. In 22 of the 59, five or six seeds were upset in the first round, and the top seed won eight times.

In the remaining 19 Slams, in which seven or more seeds were upset in the first round, the top seed won only five times. Serena has “lost” four of those events, most recently last year’s Wimbledon, when nine seeds fell in their opening matches and Cornet defeated her in the third round.

This is necessarily a small sample, and even setting aside statistical qualms, it doesn’t tell the whole story. While Serena has failed to win four of these carnage-ridden majors, she has won three more of them when she wasn’t the top seed, including the 2012 US Open, when ten seeds lost in the first round and Williams went on to beat Victoria Azarenka in the final.

Taken together, the evidence is decidedly mixed. With the exception of Cornet, the ten defeated seeds aren’t the ones Serena would’ve chosen to remove from her path. While her odds have improved a bit on paper, the path through Keys, Bencic, Svitolina, and Halep or Kvitova in the final is as difficult as any she was likely to face.

The Unalarming Rate of Grand Slam Retirements

Yesterday, Vitalia Diatchenko proved to be even less of a match for Serena Williams than expected. She retired down 6-0, 2-0, winning only 5 of 37 points. She also sparked the usual array of questions about how Grand Slam prize money–$39,500 for first-round losers–incentivizes players to show up and collect a check even if they aren’t physically fit to play.

Diatchenko wasn’t the only player to exit yesterday without finishing a match. Of the 32 men’s matches, six ended in retirement. On the other hand, none of those were nearly as bad. All six injured men played at least two sets, and five of them won a set.

The prominence of Serena’s first-round match, combined with the sheer number of Monday retirements, is sure to keep pundits busy for a few days proposing rule changes. As we’ll see, however, there’s little evidence of a trend, and no need to change the rules.

Men’s slam retirements in context

Before yesterday’s bloodbath, there had been only five first-round retirements in the men’s halves of this year’s Grand Slams. The up-to-date total of 11 retirements is exactly equal to the annual average from 1997-2014 and the same as the number of first-round retirements in 1994.

The number of first-round Slam retirements has trended up slightly over the last 20 years. From 1995 to 2004, an average of ten men bowed out of their first-round matches each year. From 2005 to 2014, the average was 12.2–in large part thanks to the total of 19 first-round retirements last season.

That rise represents an increase in injuries and retirements in general, not a jump in unfit players showing up for Slams. From 1995 to 2004, an average of 8.5 players retired or withdrew from Slam matches after the first round, while in the following ten years, that number rose to 10.8.

Retirements at other tour-level events tell the same story. At non-Slams from 1995-2004, the retirement rate was about 1.3%, and in the following ten years, it rose to approximately 1.8%. (There isn’t much of a difference between first-round and later-round retirements at non-Slams.)

Injury rates in general have risen–exactly what we’d expect from a sport that has become increasingly physical. Based on recent results, we shouldn’t be surprised to see more retirements in best-of-five matches, as most of yesterday’s victims would’ve survived to the end of a best-of-three contest.

Women’s slam retirements

In most seasons, the rate of first-round retirements in women’s Grand Slam draws is barely half of the corresponding rate in other tour events.

In the last ten years, just over 1.2% of Slam entrants have quit their first-round match early. The equivalent rate in later Slam rounds is 1.1%, and the first-round rate at non-Slam tournaments is 2.26%. Diatchenko was the fifth woman to retire in a Slam first round this year, and if one more does so today, the total of six retirements will be exactly in line with the 1.2% average.

One painful anecdote isn’t a trend, and the spotlight of a high-profile match shouldn’t give any more weight to a single data point. Even with the giant checks on offer to first-round losers, players are not showing up unfit to play any more often than they do throughout the rest of the season.

US Open Point-by-Point Stats Recap

As regular readers know, I’m working on a system to track every shot in a tennis match and then generate meaningful data based on the results.  Once I hammer out a few final bugs, I’ll introduce that system publicly.  Then, with my interactive Excel doc–and at least a little bit of practice–you can chart matches as well.

In the meantime, I’ve added another set of tables to each one of the point-by-point recaps.  My system allows (but does not require) the tracking of each shot’s direction, which seems particularly valuable in the case of a tactical baseline matchup like Monday’s final.  Follow the link to the men’s final stats, and then click either of the “shot direction” links.  I’ve broken down each player’s shots into crosscourt, down the middle, down the line, inside-out, and inside-in, then broken down each specific shot type (e.g. “forehand inside-out”) and shown the results of that shot.

At this point, the numbers are little more than a basis for conversation and speculation.  Except for Serena Williams and Victoria Azarenka, I don’t have stats on more than two matches for any individual player.  In time, however, I expect to amass a fair amount of raw data on the top-ranked men and women, and from there, we might really be able to learn something.

In the meantime, here is a list of all the point-by-point stat summaries available from the US Open.




Simpler, Better Keys to the Match

If you watched the US Open or visited its website at any point in the last two weeks, you surely noticed the involvement of IBM.  Logos and banner ads were everywhere, and even usually-reliable news sites made a point of telling us about the company’s cutting-edge analytics.

Particularly difficult to miss were the IBM “Keys to the Match,” three indicators per player per match.  The name and nature of the “keys” strongly imply some kind of predictive power: IBM refers to its tennis offerings as “predictive analytics” and endlessly trumpets its database of 41 million data points.

Yet, as Carl Bialik wrote for the Wall Street Journal, these analytics aren’t so predictive.

It’s common to find that the losing player met more “keys” than the winner did, as was the case in the DjokovicWawrinka semifinal.  Even when the winner captured more keys, some of these indicators sound particularly irrelevant, such as “average less than 6.5 points per game serving,” the one key that Rafael Nadal failed to meet in yesterday’s victory.

According to one IBM rep, their team is looking for “unusual” statistics, and in that they succeeded.  But tennis is a simple game, and unless you drill down to components and do insightful work that no one has ever done in tennis analytics, there are only a few stats that matter.  In their quest for the unusual, IBM’s team missed out on the predictive.

IBM vs generic

IBM offered keys for 86 of the 127 men’s matches at the US Open this year.  In 20 of those matches, the loser met as many or more of the keys as the winner did.  On average, the winner of each match met 1.13 more IBM keys than the loser did.

This is IBM’s best performance of the year so far.  At Wimbledon, winners averaged 1.02 more keys than losers, and in 24 matches, the loser met as many or more keys as the loser.  At Roland Garros, the numbers were 0.98 and 21, and at the Australian Open, the numbers were 1.08 and 21.

Without some kind of reference point, it’s tough to know how good or bad these numbers are.  As Carl noted: “Maybe tennis is so difficult to analyze that these keys do better than anyone else could without IBM’s reams of data and complex computer models.”

It’s not that difficult.  In fact, IBM’s millions of data points and scores of “unusual” statistics are complicating what could be very simple.

I tested some basic stats to discover whether there were more straightforward indicators that might outperform IBM’s. (Carl calls them “Sackmann Keys;” I’m going to call them “generic keys.”)  It is remarkable just how easy it was to create a set of generic keys that matched, or even slightly outperformed, IBM’s numbers.

Unsurprisingly, two of the most effective stats are winning percentage on first serves, and winning percentage on second serves.  As I’ll discuss in future posts, these stats–and others–show surprising discontinuities.  That is to say, there is a clear level at which another percentage point or two makes a huge difference in a player’s chances of winning a match.  These measurements are tailor-made for keys.

For a third key, I tried first-serve percentage.  It doesn’t have nearly the same predictive power as the other two statistics, but it has the benefit of no clear correlation with them.  You can have a high first-serve percentage but a low rate of first-serve or second-serve points won, and vice versa.  And contrary to some received wisdom, there does not seem to be some high level of first-serve percentage where more first serves is a bad thing.  It’s not linear, but he more first serves you put in the box, the better your odds of winning.

Put it all together, and we have three generic keys:

  • Winning percentage on first-serve points better than 74%
  • Winning percentage on second-serve points better than 52%
  • First-serve percentage better than 62%

These numbers are based on the last few years of ATP results on every surface except for clay.  For simplicity’s sake, I grouped together grass, hard, and indoor hard, even though separating those surfaces might yield slightly more predictive indicators.

For those 86 men’s matches at the Open this year with IBM keys, the generic keys did a little bit better.  Using my indicators–the same three for every player–the loser met as many or more keys 16 times (compared to IBM’s 20) and the winner averaged 1.15 more keys (compared to IBM’s 1.13) than the loser.  Results for other slams (with slightly different thresholds for the different surface at Roland Garros) netted similar numbers.

A smarter planet

It’s no accident that the simplest, most generic possible approach to keys provided better results than IBM’s focus on the complex and unusual.  It also helps that the generic keys are grounded in domain-specific knowledge (however rudimentary), while many of the IBM keys, such as average first serve speeds below a given number of miles per hour, or set lengths measured in minutes, reek of domain ignorance.

Indeed, comments from IBM’s reps suggest that marketing is more important than accuracy.  In Carl’s post, a rep was quoted as saying, “It’s not predictive,” despite the large and brightly-colored announcements to the contrary plastered all over the IBM-powered US Open site.  “Engagement” keeps coming up, even though engaging (and unusual) numbers may have nothing to do with match outcomes, and much of the fan engagement I’ve seen is negative.

Then again, maybe the old saw is correct: It’s all good publicity as long as they spell your name right.  And it’s not hard to spell “IBM.”

Better keys, more insight

Amid such a marketing effort, it’s easy to lose sight of the fact that the idea of match keys is a good one.  Commentators often talk about hitting certain targets, like 70% of first serves in.  Yet to my knowledge, no one had done the research.

With my generic keys as a first step, this path could get a lot more interesting.  While these single numbers are good guides to performance on hard courts, several extensions spring to mind.

Mainly, these numbers could be improved by making player-specific adjustments.  74% of first-serve points is adequate for an average returner, but what about a poor returner like John Isner?  His average first-serve winning percentage this year is nearly 79%, suggesting that he needs to come closer to that number to beat most players.  For other players, perhaps a higher rate of first serves in is crucial for victory.  Or their thresholds vary particularly dramatically based on surface.

In future posts, I’ll delve into more detail regarding these generic keys and  investigate ways in which they might be improved.  Outperforming IBM is gratifying, but if our goal is really a “smarter planet,” there is a lot more research to pursue.

Rafael Nadal d. Novak Djokovic: Recap and Detailed Stats

There are a lot of words that can be used to describe Novak Djokovic, but “sloppy” usually isn’t one of them.  Despite plenty of brilliance from the Serbian, he made far too many mistakes to win today.  Of course, the man on the other side of the net, Rafael Nadal, may be the best in game at forcing his opponent to attempt low-percentage shots out of pure desperation.

This morning, I predicted that, in order to win the match, Nadal would need to serve well, piling up more quick service points than usual, as Djokovic is a master of neutralizing the server’s advantage.  Give him a few shots, and it doesn’t matter who delivered the serve or how well they hit it.

That isn’t what happened.  Nadal won fewer than one in five service points on or before his second shot.  (Djokovic did a little better by that metric, but at 21%, not by much.)  Instead, Rafa won the way Novak usually does: by neutralizing his opponent’s serve.

Rafa won 45% of return points today, a mark he has never before reached against Djokovic on hard courts.  Even more importantly, he won return points at the same rate when Djokovic was serving at 30-30 or later.  Djokovic won what would normally be an impressive number of return points: 38%.  In recent years on hard courts, that was always enough to beat the Spaniard.

It was a different kind of hard-court match today, one that was decided in grueling rallies.  20% of points played today reached at least ten strokes, and Rafa won 59% of them.  Of points that finished more quickly, Djokovic simply gave away too many.  By my unofficial (and rather strict) count, he hit over 60 unforced errors, more than double Nadal’s total.

Too many of those sloppy shots came at crucial moments.  A bad forehand miss on a mid-court sitter gave Nadal set point in the third set, which Rafa converted on the first try.  Serving down a break in the fourth at 1-4, Djokovic quickly went up 30-30, then missed his second shot on three straight points to give Nadal another break point.  At 30-0 in that game, it was possible to imagine Novak clawing his way back.  Once the double break was sealed, the match was over.

Djokovic showed plenty of brilliance, especially in the second and third sets, and contributed to some incredible tennis moments, including ten rallies that exceeded 20 shots.  Indeed, Djokovic converted a break chance by claiming the best of those, a 54-stroke slugfest in the second set (video here).  He didn’t go quietly until that dreadful game at 1-4.

By beating Djokovic at his own game, Nadal solidified his status as the most dominant player on hard courts.  His undefeated record on the surface this year didn’t leave that in much doubt, but it had been three years since he won a hard-court Grand Slam.  Assuming he stays healthy, even Rafa might agree that he heads to Australia as the player to beat.

Here are the complete point-by-point stats from the match.

Here is a complete win-probability graph, as well.

Djokovic-Nadal XXXVII: The (Actual) Keys to the Match

Both Rafael Nadal and Novak Djokovic have had easy routes to the US Open final.  Neither was tested before the semifinals, and neither has yet to play a top-eight opponent.  Yet both were pushed further than expected in their last matches.  Djokovic nearly lost in another tough five-setter against Stanislas Wawrinka, and Nadal looked almost human at times, spraying errors in his match with Richard Gasquet.

For all that, the field is down to the final two.  They’ve played 36 times before, with Nadal leading the career matchup 21-15. On hard courts, it is the 18th meeting, with Djokovic leading 11-6.  It is their eleventh encounter in a Grand Slam, of which Rafa has won seven of the previous ten, while they’ve split their two previous US Open finals.

Based on the most relevant pieces of this head-to-head–the last seven Djokovic-Nadal matches on hard courts, dating back to the 2010 US Open–we can identify some clear trends that tell us what to watch for, and what each player must do to seal the US Open title.

The key: Rafa’s service games

Of these last seven hard-court matches, Nadal has won three and Djokovic has won four.  If we could find some statistical indicators that each player reached when they won and failed to accomplish when they lost, we might be on to something.  Think of it like IBM’s Keys to the Match, but with actual predictive value.

Sure enough, there are plenty of indicators that fit the bill, and they almost all center on Nadal’s serve:

  • In four of the matches, Nadal has served fewer than 5% aces.  In the other three, at least 7% aces.  He lost all four of the former, and won all three of the latter.
  • In four of the matches, Nadal won fewer than 70% of his first-serve points.  In the other three, he won at least 71%.  He lost all four of the former, and won all three of the latter.
  • In three of the matches, Nadal won fewer than 47% of his second-serve points.  In the other four, won at least 56%.  He lost all of the former, and won all but one (the 2011 Indian Wells final) of the latter.

We can sum up the importance of Nadal’s service games from a more Djokovic-centered perspective:

  • In three of the matches, Djokovic won no more than 33% of return points.  In the other four, he won at least 37% of return points.  Care to guess which matches he won?

Djokovic’s service non-indicators

The numbers are not nearly so clear for Djokovic’s service games.  In the two meetings when Novak hit the most aces, Rafa won.  In three of the only four matches when Djokovic made 62% or more of his first serves, Rafa won.  (These are starting to sound like some of the more inane of the IBM keys.)

Generally, winning 65% of first serves is good enough for Novak to beat Nadal, except for last month’s match in Canada, when he won 71% of first serves and lost in a third-set tiebreak.  In Djokovic’s worst second-serve performance of the seven matches, the 2011 US Open final, he barely won 44% of those points, yet won the match.

Of course, this doesn’t mean that Djokovic’s service stats don’t matter.  It’s no accident that Novak’s first-serve percentages were much higher in the three sets he won against Wawrinka than in the two sets he lost.  On the contrary, Djokovic’s serve just isn’t as potentially dominant as Nadal’s is.

For example, in Saturday’s semifinals, Nadal won 36% of his service points on or before his second shot, while Djokovic won only 24% of his service points that way.  Nadal’s number isn’t staggeringly high (for example, both Kevin Anderson and Marcos Baghdatis topped 40% in that category in their second-round match) but it’s a number he can earn only when serving well.  When he isn’t earning those cheap, quick points against Djokovic, Novak takes away the server’s advantage, threatening to break in almost every service game.

By contrast, Djokovic–like Victoria Azarenka–doesn’t consistently earn that type of advantage on serve.  Sure, he gets some free points that way, but in general, he takes the slight advantage that serving confers and uses that as an edge in a longer rally.  In the semifinal against Wawrinka, his average service point–including aces and unreturnables–lasted more than five shots.

Getting one number for Novak

Individually, Djokovic’s service stats don’t tell us much.  But if we consolidate them into one number–Nadal’s return points won–we get a little better clue of what beating Novak requires.  In the three matches where Nadal failed to win 34% of return points, he lost.  In the two matches where he won at least 42% of return points, he won.

But if you’re counting, you’ve surely noted that I left out two matches.  In Montreal last month, Nadal won only 34.7% of return points, and won.  In the 2011 US Open final, he won 41.7% of return points, yet lost.  Djokovic can be so effective in his own return games–or simply unbeatable when given break point opportunities, like he was that day–that even a masterful return performance like Nadal displayed in that final isn’t always good enough.

So Novak’s numbers just aren’t as indicative as his opponent’s.  Instead, keep your eyes on Rafa’s serve statistics.  Despite the many long, gut-busting rallies we can expect this afternoon, Nadal has this match–like his previous hard-court meetings with the world #1–on his own racquet.