The Post-Covid WTA is Drifting Back to Normal

In the two latest WTA events, we saw a mix of the expected and the unusual. Simona Halep, the heavy favorite in Prague, wound up with the title despite a couple of demanding three-setters in her first two rounds. The week’s other tournament, in Lexington, failed to follow the script. Serena Williams and Aryna Sabalenka, the big hitters at the top and bottom of the bracket, combined for three wins, with four unseeded players making up the semi-final field.

Last week I pointed out that Palermo–the tour’s initial comeback event–was so unpredictable that you would’ve been better off to treat each match as a coin flip than to use pre-layoff player strength ratings (such as Elo) to forecast outcomes. Such an upset-ridden event isn’t unheard of, even in pandemic-free times, but it is suggestive that the WTA rank-and-file haven’t quite returned to their usual form.

Prague and Lexington give us three times as much data to work with. Plus, we might theorize that Prague would be a little more predictable because so many players in that field also took part in the Palermo event, meaning that they have a little more recent match experience. While our sample of 93 main draw matches is still flimsy, it brings us a little closer to understanding how well traditional forecasts will handle this unusual time.

A thorny Brier patch

The metric I’m using to quantify predictability–or to put it another way, the validity of pre-layoff player ratings–is Brier Score, which takes into account both raw accuracy (did the forecast pick the right player to win?) and confidence level (was the forecast too strong, too weak, or just right?). Tour-level Brier Scores are usually in the range of 0.21, while a score of 0.25 means the predictions were no better than coin flips. A lower score represents more accurate predictions.

Here are the Brier Scores for Palermo, Lexington, and Prague, along with the average of the three, and the average of all WTA International events (on all surfaces) since 2017. (The scores are based on forecasts generated from my Elo ratings.) We might expect the first round to be different, since players are particularly rusty at that stage, so I’ve also broken out first round (“R32 Brier”) matches for each of the tournaments and averages in the table.

Tournament    Brier  R32 Brier  
Palermo       0.268      0.295  
Lexington     0.226      0.170  
Prague        0.212      0.247  
Comeback Avg  0.235      0.237  
Intl Avg      0.217      0.213

As we last week, the Palermo results truly defied expectations. More than half of the matches were upsets (according to my Elo ratings), with a particularly unpredictable first round.

That didn’t last. The Prague first round rated 0.247–just barely better than coin flips–but the messiness didn’t last beyond the first couple of days. The event’s overall Brier Score was 0.212, slightly better than the average WTA International. In other words, this group of 32 women, only recently returned from a months-long break, delivered results that were roughly as predictable as we would expect in the middle of a normal season.

The Lexington numbers are a bit more difficult to make sense of, but like Prague’s, they point to a post-coronavirus world that isn’t all that weird. The opening round closely followed the script, with a Brier Score of 0.170. Of the last 115 WTA International events, only 22 were more predictable. The forecast accuracy didn’t last, in large part because of Serena’s loss at the hands of Shelby Rogers. The rating for the entire tournament was 0.226, less predictable than usual, but much better than random guessing and closer to tour average than to the assumption-questioning Palermo numbers.

Revised estimates

We’re still early in the process of evaluating what to expect from players after the COVID-19 layoff. As more tournaments take place, we can identify whether players become more predictable with more matches under their belts. (Perhaps the Prague participants who skipped Palermo were more difficult to forecast, although Halep is an obvious counterexample.)

At this point, anything is possible. It could be that we will steadily drift back to business is usual. On the other hand, the new social-distancing-oriented rules–with few or no fans on site, nightlife limited to Netflix, players fetching their own towels, and new variations of on-court coaching–might work to the advantage of some women and the disadvantage of others. If that’s the case, Elo ratings will go through a novel period of adjustment as they shift to reflect which players thrive on the post-corona tour.

It’s too early to do much more than speculate about something as significant as that. But in the last week, we’ve seen forecasts go from wildly wrong (in Palermo) to not half bad (in Lexington and Prague). We’ve gained some confidence that for all the things that have obviously changed since March, our approach to player ratings may be one thing that largely remains the same.

Are Tournament Draws Giving Us Suspiciously Many Venus-Serena Clashes?

This week in Lexington, top seed Serena Williams faces her sister, Venus Williams, in the second round. They are both among the all-time greats, and they have played each other nine times in grand slam finals, so it’s always jarring to see them turn up in the same section of a draw and play on a Thursday.

Lately, their encounters seem to always happen long before the business end of a tournament. Their three matches between the 2017 Australian Open final and this week in Lexington all happened in the round of 32, including a planned 2019 Rome meeting from which Serena withdrew. Venus is usually unseeded, no longer the world-beater she once was, so it is at least possible that the Williams sisters would be bracket neighbors in any given week.

But should it happen quite so often? It is an understatement to say that Serena and Venus were not universally embraced upon arrival in the tennis world. If you’re conspiracy minded, every tournament draw is an opportunity to commit dastardly deeds. Perhaps early in the Williams era, it was the work of racist or otherwise misguided tournament officials who wanted to avoid all-Williams finals. Or nowadays, event honchos recognize that Venus is unlikely to reach the final, so they tinker with the bracket to make a headline-grabbing Williams-versus-Williams clash more likely.

I’m sure that most draws are conducted on the up-and-up, but the process is sufficiently opaque that it’s easy to get suspicious. It’s also easy to make mistaken generalizations from insufficient data. Let’s see what the numbers can tell us.

150 tournaments!

Lexington is the 150th tour event with both Serena and Venus in the field.*

* I think. My WTA data isn’t perfect for the early years of their careers, and there was an uncomfortable amount of manual tabulation involved in this post. Their TennisAbstract player pages are missing the 1999 Grand Slam Cup, but I’ve included it in all the numbers here. For the purposes of doing analytics, it doesn’t matter much if the total is 148 or 151, but if you’re printing a banner or making a cake, you should double-check.

Thursday’s match in Lexington will be their 31st, plus one withdrawal apiece. In 13 of the 150 events, the Williams sisters were either the top two seeds or the 3rd and 4th seeds, meaning that draw shenanigans were out of the question–they could not face each other until the final. 4 of those 13 times, that’s exactly what they did.

What are the odds?*

* Of me being able to use this sub-heading in any given blog post?

I went through the remaining 137 tournaments and identified the round in which they either did meet or could have met. For the purposes of analyzing draws, there isn’t really a difference. For instance, Serena and Venus have landed in the same half 73 out of a possible 137 times, a bit more than the 68 or 69 times that we would expect.

Because of their seeds, they had the chance of ending up in the same quarter 116 times, and that’s how it worked out 28 times, just under the 29 times that an exact one-in-four rate would’ve given them. The smaller the draw section, the fewer tournaments that Serena’s and Venus’s seeds made it possible for them to meet.

I counted the number of tournaments with a possible meeting on or before a certain round, and then the number of events in which the draw delivered that meeting, regardless of whether both Williamses got that far. Here are the results, along with the probability of that many or more actual meetings:

Section  Possible  Actual  Chance  
Half          137      73     25%  
Quarter       116      28     62%  
Eighth         85      17      3%  
16th           64       5     37%  
32nd           42       1     74%

There’s a one-in-four chance that Serena and Venus would’ve landed in the same half as many times as they have throughout their entire careers. That’s a bit of bad luck, but it’s hardly a smoking gun. The same is true for the same quarters, as well as very early meetings that would pit them against each other in the round of 32 or 64.

That leaves one eyebrow-raising number to discuss. On 85 occasions, at least one of the two women was seeded outside the top eight, making possible a meeting in the round of 16 or earlier. Given random draws, we’d expect 10 or 11 brackets in which they could face each other so early. Instead, we got 17.

A 3% chance of so many early encounters isn’t quite as bad as it sounds. I’ve tried to walk you through this process in the way I approached it. While I wondered if Serena and Venus have met more often than random draws would normally deliver, I didn’t have a particular round in mind. As you’ve seen, I generated a bunch of numbers, and one of the five looked suspicious. You might be able to construct a story that explains why the round of 16 is different from the others (such as my theory that tournament directors want mid-week headlines), but because we generated so many numbers, we were that much more likely to end up with an extreme percentage simply by chance.

The smoking (nerf) gun

Thus, we’re able to raise the possibilities that some draws weren’t random, but we can hardly prove it. One problem–one that we could’ve foreseen from the get-go–is that some draws are definitely not tampered with. Probably most draws. And even if they were, most tournaments wouldn’t have any reason to mess with Serena’s or Venus’s placement in the bracket. Or if they did, they might prefer an all-Williams final, and thus alter the bracket in the opposite direction of what we’re hunting for.

If you like conspiracy hunting, I’ve got a tiny sample for you. Since the beginning of 2018, Venus and Serena have played in the same tournament 15 times, and their seedings (or lack thereof) made it possible for them to be drawn in the same eighth 14 of those times. Of the 14, they were placed in position for a round-of-16 or earlier meeting 5 times. There’s only a 2% chance of that … if you set aside the fact that I’m checking all sorts of subsets of matches looking for (probably spurious) patterns. If nothing else, the 5-of-14 figure explains why it seems like Serena and Venus keep landing in the same draw sections lately. They do!

Broadly speaking, then, this is all much ado about nothing. (I don’t even know if these conspiracy theorists exist, so maybe I just invented a conspiracy and spent my evening debunking it. Hooray?) It’s possible that a few tournament directors are producing non-random draws … but it would take a very different kind of investigative work to prove it. Worst case scenario, we get a few more Serena-Venus matches. It may not be fair to the older sister, but it’s a pretty good deal for tennis fans.

Did Palermo Show the Signs of a Five-Month Pandemic Layoff?

Are tennis players tougher to predict when they haven’t played an official match for almost half a year? Last week’s WTA return-to-(sort-of)-normal in Palermo gave us a glimpse into that question. In a post last week I speculated that results would be tougher than usual to forecast for awhile, necessitating some tweaks to my Elo algorithm. The 31 main draw matches from Sicily allow us to run some preliminary tests.

At first glance, the results look a bit surprising. Only two of the eight seeds reached the semifinals, and the ultimate champion was the unseeded Fiona Ferro. Two wild cards reached the quarters. Is that notably weird for a WTA International-level event? It doesn’t seem that strange, so let’s establish a baseline.

Palermo the unpredictable

My go-to metric for “predictability” is Brier Score, which measures the accuracy of percentage forecasts. It’s nice to pick the winner, but it’s more important to assign the right level of probability. If you say that 100 matches are all 60/40 propositions, your favorites should win 60 of the 100 matches. If they win 90, you weren’t nearly confident enough; if they win 50, you would’ve been better off flipping a coin. Brier Score encapsulates those notions into a single number, the lower the better. Roughly speaking, my Elo forecasts for ATP and WTA matches hover a bit above 0.2.

From 2017 through March 2020, the 975 completed matches at clay-court WTA International events had a collective Brier Score of 0.223. First round matches were a tiny bit more predictable, with R32’s scoring 0.219.

Palermo was a roller-coaster by comparison. The 31 main-draw matches combined for a Brier Score of 0.268. Of the 32 other events I considered, only last year’s Prague tourney was higher, generating a 0.277 mark.

The first round was more unpredictable still, at 0.295. On the other hand, the combination of a smaller per-event sample and the wide variety of first-round fields means that several tournaments were wilder for the first few days. 9 of the 32 others had a first-round Brier Score above 0.250, with four of them scoring higher–that is, worse–than Palermo did.

The Brier Score of shame

I mentioned the 0.250 mark because it is a sort of Brier Score of shame. Let’s say you’re predicting the outcome of a series of coin flips. The smart pick is 50/50 every time. It’s boring, but forecasting something more extreme just means you’re even more wrong half the time. If you set your forecast at 50% for a series of random events with a 50/50 chance of occurring, your Brier Score will be … 0.250.

Another way to put it is this: If your Brier Score is higher than 0.250, you would’ve been better off predicting that every match was 50/50. All the fancy forecasting went to waste.

In Palermo, 17 of the 31 matches went the way of the underdog, at least according to my Elo formula. The Brier Scores were on the shameful side of the line. My earlier post–which advocated moderating all forecasts, at least a bit–didn’t go far enough. At least so far, the best course would’ve been to scrap the algorithm entirely and start flipping that coin.

Moderating the moderation

All that said, I’m not quite ready to throw away my Elo ratings. (At the moment, they pick Simona Halep and Aryna Sabalenka, my two favorite players, to win in Prague in Lexington. So there’s that.) 31 matches is small sample, far from adequate to judge the accuracy of a system designed to predict the outcome of thousands of matches each year. As I mentioned above, Elo failed even worse at Prague last year, but because that tournament didn’t follow several months of global shutdowns, it wouldn’t have even occurred to me to treat it as more than a blip.

This time, a week full of forecast-busting surprises could well be more than a blip. Treating players as if they have exactly the abilities they had in March is probably the wrong way to do things, and it could be a very wrong way of doing things. We’ll triple the size our sample in the next week, and expand it even more over the next month. It won’t help us pick winners right now, but soon we’ll have a better idea of just how unpredictable the post-COVID-19 tennis world really is.

Did Jimmy Connors Choke in the 1975 Wimbledon Final?

From our vantage point almost a half-century later, it’s easy to forget just how big an upset Arthur Ashe scored with his 1975 Wimbledon victory over Jimmy Connors. Connors was the top seed and defending champion, still riding high from a 1974 campaign that ranks among the best ever. Ashe was a few days short of his 32nd birthday, had a reputation of coming up short in finals, and had lost to Connors in their three previous meetings.

(For what it’s worth, my Elo algorithm thinks it was a much closer match than the bookies did at the time. It rated Ashe the second-best player in the tournament on grass courts, and gave the underdog a 39% chance of winning.)

Ashe ran away with the first two sets and held on to win in four, 6-1 6-1 5-7 6-4. Perhaps because the two men didn’t get along–apart from striking personality differences, Connors and his manager targeted Ashe with one of many lawsuits–the veteran was uncharacteristically critical of his opponent after the match. Ashe claimed that Connors missed many of his shots into the net (rather than long), a sign of choking.

Connors denied it, of course. It later came out that Jimmy was dealing with a foot problem which probably affected his play that day. In any case, fans and pundits surely had their fun debating whether Connors was a choker. I don’t know of anyone who took the question beyond simple speculation. No amount of statistical analysis can settle whether a player choked, but we can often answer adjacent questions to shed more light on the issue.

Counting errors

A couple of years ago I charted the Wimbledon final for the Match Charting Project, so we have a full count of errors–forced and unforced, serves and rallying shots, net and deep–for the entire match. We also have similar shot-by-shot stats for 25 other Connors matches for comparison. (Unfortunately, 24 of the 25 are chronologically later than the Ashe match, because there’s not much full-match footage from the early 70s.)

Here’s the tally: Excluding serves, Connors committed 13 unforced errors, 10 of them into the net. I recorded the type of error for 65 more forced errors: 32 into the net, 33 other. (Ashe was a netrusher, so many of Jimbo’s mistakes were failed passing shots.) On serve, he missed 29 first deliveries: 16 into the net, 13 otherwise. And his two second serve faults were split between one into the net and one elsewhere.

The unforced error split of 10-to-3 means that 77% of his UFEs were netted. That’s the most extreme of any of his charted matches; on average, his unforced errors were half nets, half others. While suggestive, that’s an awfully small sample from which to draw any conclusions.

Using larger samples that include forced errors and serves, the Wimbledon final doesn’t particularly stand out among other charted Connors matches. 54% of his non-serve errors (forced or unforced) in that match were netted, compared to 52% over the whole sample. 55% of his service faults against Ashe were hit into the net, versus 49% across the 26 matches. Altogether, Connors made 54% of his total errors and faults into the net in the Wimbledon final, compared to 51% in the broader sample.

Does it matter?

You’ve probably heard the tennis coaching conventional wisdom that it’s better to hit long than to hit into the net. Like most tennis shibboleths, this one has been around for a very long time. Ashe had surely heard it, which partly explains why he made the comment he did. Arthur didn’t have a printout with match stats generated by a consulting company with a gargantuan marketing budget, so he probably recalled a few key points and generalized from there.

If error types matter, we’d expect to see at least a mild correlation between results (say, percentage of points won) and error types. Let’s stay focused on the 26 charted Connors matches for today’s purposes. Here’s a version of the Ashe hypothesis, stripped of emotional content:

When Connors hits more errors than usual into the net, it’s a sign that he’s playing below his standard level.

It turns out that this theory is wrong–or, at best, possibly correct if narrowly defined. I considered five main stats as indicators of errors and faults going into the net:

  • Unforced errors (excluding double faults) into the net as a percentage of total unforced errors
  • Total rally errors (forced and unforced) into the net as a percentage of total errors
  • First serve faults into the net as a percentage of total first serve faults
  • All serve faults into the net as a percentage of all serve faults
  • All errors and faults into the net as a percentage of all errors and faults

The second (total rally errors) and last (all errors and faults) seem like the most valid of the five, because they give us a decent sample of error types for each match. There is almost exactly zero correlation between the last stat and total points won. And there is a very weak negative correlation (r^2 = 0.05) between the second stat and total points won.

In other words, the Ashe hypothesis might be on to something very minor if our focus in on rally shots. But the correlation is so weak that no human observer would ever notice it, unless they lucked into it by watching a few confirming key moments after being primed by the conventional wisdom.

He didn’t choke like that

I said above that statistical analysis couldn’t settle issues like whether a player choked. We can study what happened, but without machines hooked up to a player’s brain, we can’t tell what was going on inside their heads that might have caused it.

So we can’t say that Connors didn’t choke in the 1975 Wimbledon final. But we have seen that his percentage of into-the-net errors wasn’t that unusual for him (except for the small sample of unforced errors), and we’ve recognized that the number of mistakes he made into the net didn’t have much to say about his level of play that day. If Connors choked, then, it didn’t have anything to do with the low trajectory of his missed shots.

I learned of Ashe’s post-match comment in Raymond Arsenault’s excellent biography, Arthur Ashe: A Life.

Elo, Meet COVID-19

Tennis is back, and no one knows quite what to expect. Unpredictability is the new normal at both the macro level–will the US Open be a virus-ridden disaster?–and the micro level–which players will come back stronger or weaker? While I plead ignorance on the macro issues, estimating player abilities is more in my line.

Thanks to global shutdowns, every professional player has spent almost five months away from ATP, WTA, and ITF events–“official” tournaments. Some pros, such as those who didn’t play in the few weeks before the shutdowns began, or who are opting not to compete at the first possible opportunity, will have sat out seven or eight months by the time they return to court. Exhibition matches have filled some of the gap, but not for every player.

Half a year is a long time without any official matches. Or, from the analyst’s perspective: It’s tough to predict a player’s performance without any data from the last six months.

Increased uncertainty

Let’s start with the obvious. All this time off means that we know less about each player’s current ability level than we did before the shutdown, back when most pros were competing every week or two. Back in March, my Elo ratings put Dominic Thiem in 5th place, with a rating of ~2050, and David Goffin in 15th, with a rating of ~1900. Those numbers gave Thiem a 70% chance of winning a head-to-head.

What about now? Both men have played in exhibitions, but can we be confident that their levels are the same as they were in March? Or that they’ve risen or fallen roughly the same amount? To me, it’s obvious that we can’t be as sure. Whenever our confidence drops, our predictions should move toward the “naive” prediction of a 50/50 coin flip. A six-month coronavirus layoff isn’t that severe, so it doesn’t mean that Thiem is no longer the favorite against Goffin, but it does mean our prediction should be closer to 50% than it was before.

So, 60%? Maybe 65%? Or 69%? I can’t answer that–yet, anyway.

The (injury) layoff penalty

My Elo ratings already incorporate a layoff penalty, which I introduced here. The idea is that if a player misses a substantial amount of time (usually due to injury, but possibly because of suspension, pregnancy, or other reasons), they usually play worse when they come back. But it’s tough to predict how much worse, and players regain their form at different rates.

Thus, the tweak to the rating formula has two components:

  • A one-time penalty based on the amount of time missed (more time off = bigger penalty)
  • A temporarily increased k-factor (the part of the formula that determines how much each match increases or decreases a player’s rating) to account for the initial uncertainty. After an injury, the k-factor increases by a bit more than 50%, and steadily declines back to the typical k-factor over the next 20 matches.

Not an injury

A six-month coronavirus layoff is not an injury. (At least, not for players who haven’t lost practice time due to contracting COVID-19 or picking up other maladies.) So the injury-penalty algorithm can’t be applied as-is. But we can take away two ideas from the injury penalty:

  • If we generate those closer-to-50% forecasts by shifting certain players’ ratings downward, the penalty should be less than the injury penalty. (The minimum injury penalty is 100 Elo points for a non-offseason layoff of eight or nine weeks.)
  • The temporarily increased k-factor is a useful tool to handle the type of uncertainty that surrounds a player’s ability level after a layoff.

The injury-penalty framework is useful because it has been validated by data. We can look at hundreds of injury (and other) layoffs in modern tennis history and see how players fared upon return. And the numbers I use in the Elo formula are based on exactly that. We don’t have the same luxury with the last six months, because it is so unprecedented.

Not an offseason, but…

The closest thing we have to a half-year shutdown in existing tennis data is the offseason. The sport’s winter break is much shorter, and it isn’t the same for every player. Yet some of the dynamics are the same: Many players fill their time with exhibitions, others sit on the beach, some let injuries recover, others work particularly hard to improve their games, and so on.

Here’s a theory, then: The first few weeks of each season should be less predictable than average.

Fact check: False! For the years 2010-19, I labeled each match according to how many previous matches the two players had contested that year. If it was both players’ first match, the label was 1. If it was one player’s 15th match and the other’s 21st, the label was the average, 18. Then, I calculated the Brier Score–a measure of prediction accuracy–of the Elo-generated predictions for the matches with each label.

The lower the Brier Score, the better. If my theory were right, we would see the highest Brier Scores for the first few matches of the season, followed by a decrease. Not exactly!

The jagged blue line shows the Brier Scores for each individual label (match 1, match 2, match 23, etc), while the orange line is a 5-match moving average that aims to represent the overall trend.

There’s not a huge difference throughout the season (which is reassuring), but the early-season trend is the opposite of what I predicted. Maybe the women, with their slightly longer offseason, will make me feel better?

No such luck. Again, the match-to-match variation in prediction accuracy is very small, and there’s no sign of early-season uncertainty.

I will not be denied

Despite disproving my own theory, I still expect to see an unpredictable couple of post-pandemic months. The regular offseason is something that players are accustomed to, and there is conventional wisdom in the game surrounding how to best use that time. And it’s two months, not five to seven. In addition, there are many other things that will make tour life more challenging–or different, at the very least–in 2020, such as limited crowds, social distancing protocols, and scheduling uncertainty. Some players will better handle those challenges than others, but it won’t necessarily be the strongest players who respond the best.

So my Elo ratings will, for the time being, incorporate a small penalty and a temporarily increased k-factor. (Something more like 69% for Thiem-Goffin, not 60%.) I haven’t finished the code yet, in large part because handling the two different types of layoffs–coronavirus and the usual injuries, etc–makes things very complicated. If you’re watching closely, you’ll see some minor tweaks to the numbers before the “Cincinnati” tournament in a few weeks.

There is a right answer

It’s clear from what I’ve written so far that any attempt to adjust Elo ratings for the COVID-19 layoff is a bit of a guessing game. But it won’t always be that way!

By the end of the year, we’ll know the right answer: just how unpredictable results turned out to be in the early going. Just as I’ve calculated penalties and k-factor adjustments for injury layoffs based on historical data, we will be able to do the same with match results from the second half of 2020. To be more precise, we’ll be able to work out a class of right answers, because one adjustment to the Elo formula will give us the best Brier Score, while another will best represent the gap between Novak Djokovic and Rafael Nadal, while others could target different goals.

The ultimate after-the-fact COVID-19 Elo-formula adjustment won’t help you win more money betting on tennis, but it will give us more insight into how the coronavirus layoff affected players after so much time off, and how quickly they returned to pre-layoff form. We’ll understand a little bit more about the game, even if we desperately hope never to have reason to apply the newly-won knowledge.

Tanking: A Model

The logic behind tanking a part of a tennis match–deliberately playing with less than maximum effort–is simple. If you have fallen behind early in the first set, you could choose to take it easy for the rest of the set. You probably would’ve lost the set anyway, and having semi-rested for several games, you’ll have more mental and physical energy to draw upon for the rest of the match.

By the end of this post, we’ll have some idea how useful that extra energy must be to make tanking worthwhile. It will take a few steps to get there.

The scenario

Consider some sample numbers to make this more concrete. Take two evenly matched men, each of whom win 70% of their service points. Maybe they are powerful–though not one-dimensional–servers on a reasonably fast surface. Winning seven out of ten service points means that nine out of ten games are holds of serve, so in our hypothetical match, breaks are at a premium.

Imagine that the match opens with one of those rare breaks. Given the 90% hold rate for both players, the man who got his nose in front has improved to an 83% chance of winning the set. In the simplest formulation, the player who has fallen behind faces two options for the balance of the set:

  • Continue playing at his usual level despite the low chance of winning, or
  • Take it easy, as the set is probably lost.

The tank

In the continue-as-usual scenario, our early front-runner has an 83% chance of winning the set. If both players continue playing at the same level for the duration of a best-of-five-sets match, that translates to a 62% chance of winning the match, leaving our player who decided not to tank with a 38% chance. (I’m using best-of-five because in a longer match, it’s more likely that a player can recover from losing the first set. That makes tanking a more plausible strategy.)

To evaluate the take-it-easy scenario, we need to pile on more assumptions. How much worse does a tanking player play? You will probably disagree with my estimates of the point-level costs and benefits of tanking, which is fine. I don’t have strong opinions about them, and they don’t matter much to the conclusions below. Consider these numbers just one illustration of the model. As soon as the trailing player decides to take it easy, let’s say his numbers fall to the following:

  • 20% return points won (instead of 30%)
  • 65% serve points won (instead of 70%)

That’s not a very good player–picture an unmotivated Nick Kyrgios. Down a break after the first game and playing a newly lackadaisical brand of tennis, he has a mere 1.3% chance of coming back to win the set. We’re simplifying quite a bit here, in large part because a player could always decide midway through the set to pause the tank, perhaps raising his game if he reaches 15-30 or better on his opponent’s serve. But again, this is just a model, and one I’m trying to keep from getting too complex.

The trade-off

The tanking player has, according to these assumptions, chosen to decrease his chance of winning the first set from 17% to a tick above 1%. If he received no benefit from conserving energy and both players returned to their 90% hold rate at the beginning of the next set, the tanking player’s chances of winning the match have fallen from 38% to 32%.

Clearly that’s not the whole story. A player who chooses to conserve energy at the expense of their immediate fortunes must assume that there are benefits coming later.

To further simplify, let’s assume that the tanking player loses the first set. Here are his chances of winning the match based on a few possible post-tank levels he could sustain:

  • 70% serve points won (SPW), 30% return points won (RPW): 31.3% (no benefit from tanking)
  • 71% SPW, 32% RPW: 46.3%
  • 72% SPW, 34% RPW: 61.9%
  • 73% SPW, 36% RPW: 75.8%
  • 74% SPW, 38% RPW: 93.3%

Remember that our tanking player has only a 38% chance of winning the match after sustaining the opening-game break, so the second scenario, in which his level improves to 71% SPW and 32% RPW, represents an improvement. That would be hardly noticeable over the course of three or four more sets. If the remainder of the match spanned 200 more points, it would mean winning 103 of them, instead of 100. If conserving energy early on confers even bigger benefits, it starts to look like a no-brainer.

Complications

Of course, it’s never this simple. The leading player might realize that his opponent was tanking and conserve some energy himself. The tanking player could have a hard time resuming his usual level (or better) at the right moment. Some points are more important than others, so the difference between 100 and 103 might not matter. Most matches are best-of-three, and giving up on the opening set in a shorter match is much more dangerous.

Those qualifications shouldn’t stop us from considering what tanking has to offer. While players don’t tank sets as often as they used to, there’s surely some energy-conservation benefit, and extra energy must have some value for the remainder of the match, right? I have no idea whether that value is equivalent to one point per hundred or something much higher or lower, but surely it’s possible that in some situations, it’s worth it.

The illustration I’ve used shows that the value of the extra energy doesn’t have to be that substantial to make tanking a plausible tactic. The small margins that determine the outcome of tennis matches mean that we’ll rarely recognize when a player is taking advantage of a tank, but those margins also mean that a small edge could be enough to make it worthwhile.

All calculations of game, set, and match probabilities are based on my publicly-available code.

Podcast Episode 83: Is the Practice Court Broken?

Episode 83 of the Tennis Abstract Podcast features co-host Carl Bialik, of the Thirty Love podcast, and guest Jeff McFarland of Hidden Game of Tennis. This week we dip our collective toe into a debate in the tennis coaching world.

With rallies short and aggressive, should players be using practice time differently? What types of skills can still be improved, once a player has reached the top? What tactics can a coach teach their charges, and which ones are too deeply ingrained in the physical nature of hitting the shots? The line between technique and tactics may not be a clear-cut as we think.

Is a 3- or 4-shot rally qualitatively different from a 5- or more-shot rally? How would you teach Madison Keys to retain the positives of her aggressive style while dialing back the aggression a bit? We offer more questions than answers, which seems appropriate for a topic that is far from settled, and is likely to remain controversial for years to come.

Thanks for listening!

(Note: this week’s episode is about 67 minutes long; in some browsers the audio player may display a different length. Sorry about that!)

Click to listen, subscribe on iTunes, or use our feed to get updates on your favorite podcast software.

Who’s the GOAT? Balancing Career and Peak Greatness With Elo Ratings

On this week’s podcast, Carl, Jeff and I briefly discussed where Caroline Wozniacki ranks among Open-era greats. She’s among the top ten measured by weeks at the top of the rankings, but she has won only a single major. By Jeff’s Championship Shares metric, she’s barely in the top 30.

I posed the same question on Twitter, and the hive mind cautiously placed her outside the top 20:

https://twitter.com/tennisabstract/status/1214491560026484737

It’s difficult to compare different sorts of accomplishments–such as weeks at number one, majors won, and other titles–even without trying to adjust for different eras. It’s also challenging to measure different types of careers against each other. For more than a decade, Wozniacki has been a consistent threat near the top of the game, while other players who won more slams did so in a much shorter burst of elite-level play.

Elo to the rescue

How good must a player be before she is considered “great?” I don’t expect everyone to agree on this question, and as we’ll see, a precise consensus isn’t necessary. If we take a look at the current Elo ratings, a very convenient round number presents itself. Seven players rate higher than 2000: Ashleigh Barty, Naomi Osaka, Bianca Andreescu, Simona Halep, Karolina Pliskova, Elina Svitolina, and Petra Kvitova. Aryna Sabalenka just misses.

Another 25 active players have reached an Elo rating of at least 2000 at their peak, from all-time greats such as Serena Williams and Venus Williams down to others who had brief, great-ish spells, such as Alize Cornet and Anastasia Pavlyuchenkova. Since 1977, 88 women finished at least one season with an Elo rating of 2000 or higher, and 60 of them did so at least twice.

(I’m using 1977 because of limitations in the data. I don’t have complete match results–or anything close!–for the early and mid 1970s. Unfortunately, that means we’ll underrate some players who began their careers before 1977, such as Chris Evert, and we’ll severely undervalue the greats of the prior decade, such as Billie Jean King and Margaret Court.)

The resulting list of 60 includes anyone you might consider an elite player from the last 45 years, along with the usual dose of surprises. (Remember Irina Spirlea?) I’ll trot out the full list in a bit.

Measuring magnitude

A year-end Elo rating of 2000 is an impressive achievement. But among greats, that number is a mere qualifying standard. Serena has had years above 2400, and Steffi Graf once cleared the 2500 mark. For each season, we’ll convert the year-end Elo into a “greatness quotient” that is simply the difference between the year-end Elo and our threshold of 2000. Barty finished her 2019 season with a rating of 2123, so her greatness quotient (GQ) is 123.

(Yes, I know it isn’t a quotient. “Greatness difference” doesn’t quite have the same ring.)

To measure a player’s greatness over the course of her career, we simply find the greatness quotient for each season which she finished above 2000, and add them together. For Serena, that means a whopping 20 single-season quotients. Wozniacki had nine such seasons, and so far, Barty has two. I’ll have more to say shortly about why I like this approach and what the numbers are telling us.

First, let’s look at the rankings. I’ve shown every player with at least two qualifying seasons. “Seasons” is the number of years with year-end Elos of 2000 or better, and “Peak” is the highest year-end Elo the player achieved:

Rank  Player                     Seasons  Peak    GQ  
1     Steffi Graf                     14  2505  4784  
2     Serena Williams                 20  2448  4569  
3     Martina Navratilova             17  2442  4285  
4     Venus Williams                  14  2394  2888  
5     Chris Evert                     14  2293  2878  
6     Lindsay Davenport               12  2353  2744  
7     Monica Seles                    11  2462  2396  
8     Maria Sharapova                 13  2287  2280  
9     Justine Henin                    9  2411  2237  
10    Martina Hingis                   8  2366  1932  
11    Kim Clijsters                    9  2366  1754  
12    Gabriela Sabatini                9  2271  1560  
13    Arantxa Sanchez Vicario         12  2314  1556  
14    Amelie Mauresmo                  6  2279  1113  
15    Victoria Azarenka                9  2261  1082  
16    Jennifer Capriati                8  2214   929  
17    Jana Novotna                     9  2189   848  
18    Conchita Martinez               11  2191   836  
19    Caroline Wozniacki               9  2189   674  
20    Tracy Austin                     5  2214   647  
                                                      
Rank  Player                     Seasons  Peak    GQ  
21    Mary Pierce                      8  2161   637  
22    Elena Dementieva                 9  2140   629  
23    Simona Halep                     7  2108   562  
24    Svetlana Kuznetsova              6  2136   543  
25    Hana Mandlikova                  6  2160   516  
26    Jelena Jankovic                  4  2178   450  
27    Pam Shriver                      5  2160   431  
28    Vera Zvonareva                   5  2117   414  
29    Agnieszka Radwanska              8  2106   399  
30    Ana Ivanovic                     5  2133   393  
31    Petra Kvitova                    6  2132   346  
32    Na Li                            4  2095   310  
33    Anastasia Myskina                4  2164   290  
34    Anke Huber                       6  2072   277  
35    Mary Joe Fernandez               4  2110   274  
36    Nadia Petrova                    6  2094   265  
37    Dinara Safina                    3  2132   240  
38    Andrea Jaeger                    4  2087   237  
39    Angelique Kerber                 4  2109   224  
40    Nicole Vaidisova                 3  2121   222  
                                                      
Rank  Player                     Seasons  Peak    GQ  
41    Manuela Maleeva Fragniere        6  2059   194  
42    Anna Chakvetadze                 2  2107   174  
43    Ashleigh Barty                   2  2123   162  
44    Helena Sukova                    3  2078   150  
45    Jelena Dokic                     2  2110   142  
46    Iva Majoli                       2  2067   119  
47    Elina Svitolina                  3  2052   108  
48    Garbine Muguruza                 2  2061    98  
49    Zina Garrison                    2  2065    96  
50    Samantha Stosur                  3  2061    92  
51    Daniela Hantuchova               2  2050    80  
52    Irina Spirlea                    2  2064    76  
53    Nathalie Tauziat                 3  2041    73  
54    Patty Schnyder                   2  2057    70  
55    Chanda Rubin                     3  2034    68  
56    Marion Bartoli                   2  2033    66  
57    Sandrine Testud                  2  2041    62  
58    Magdalena Maleeva                2  2024    41  
59    Karolina Pliskova                2  2028    37  
60    Dominika Cibulkova               2  2007     7

You’ll probably find fault with some of the ordering here. While it isn’t the exact list I’d construct, either, my first reaction is that this is an extremely solid result for such a simple algorithm. In general, the players with long peaks are near the top–but only because they were so good for much of that time. A long peak, like that of Conchita Martinez, isn’t an automatic ticket into the top ten.

From the opposite perspective, this method gives plenty of respect to women who were extremely good for shorter periods of time. Both Amelie Mauresmo and Tracy Austin crack the top 20 with six or fewer qualifying seasons, while others with as many years with an Elo of 2000 or higher, such as Manuela Maleeva Fragniere, find themselves much lower on the list.

Steffi, Serena, and the threshold

It’s worth thinking about what exactly the Elo rating threshold of 2000 means. At the simplest level, we’re drawing a line, below which we don’t consider a player at all. (Sorry, Aryna, your time will come!) Less obviously, we’re defining how great seasons compare to one another.

For instance, we’ve seen that Barty’s 2019 GQ was 123. Graf’s 1989 season, with a year-end Elo rating of 2505, gave her a GQ of 505. Our threshold choice of 2000 implies that Graf’s peak season has approximately four times the value of Barty’s. That’s not a natural law. If we changed the threshold to 1900, Barty’s GQ would be 223, compared to Graf’s best of 605. As a result, Steffi’s season is only worth about three times as much.

The lower the threshold, the more value we give to longevity and the less value we give to truly outstanding seasons. If we lower the threshold to 1950, Steffi and Serena swap places at the top of the list. (Either way, it’s close.) Even though Williams had one of the highest peaks in tennis history, it’s her longevity that truly sets her apart.

I don’t want to get hung up on whether Serena or Steffi should be at the top of this list–it’s not a precise measurement, so as far as I’m concerned, it’s basically a tie. (And that’s without even raising the issue of era differences.) I also don’t want to tweak the parameters just to get a result or two to look different.

Ranking Woz

I began this post with a question about Caroline Wozniacki. As we’ve seen, greatness quotient places her 19th among players since 1977–almost exactly halfway between her position on the weeks-at-number-one list and her standing on the title-oriented Championship Shares table.

If we had better data for the first decade of the Open era, Wozniacki and many others would see their rankings fall by at least a few spots. King, Court, and Evonne Goolagong Cawley would knock her into the 20s. Virginia Wade might claim a slot in the top 20 as well. We can quibble about the exact result, but we’ve nailed down a plausible range for the 2018 Australian Open champion.

One-number solutions like this aren’t perfect, in part because they depend on assumptions like the Elo threshold discussed above. Just because they give us authoritative-looking lists doesn’t mean they are the final word.

On the other hand, they offer an enormous benefit, allowing us to get around the unresolvable minor debates about the level of competition when she reached number one, the luck of the draw at grand slams she won and lost, the impact of her scheduling on ranking, and so on. By building a rating based on every opponent and match result, Elo incorporates all this data. When ranking all-time greats, many fans already rely too much on one single number: the career slam count. Greatness quotient is a whole lot better than that.

Are American Players Screwed Once You Drag Them Into a Rally?

Long after retiring from tennis, Marat Safin remains quotable. The Russian captain at the ATP Cup had this to say to his charge, Karen Khachanov, during a match against Taylor Fritz:

This isn’t exactly testable. I don’t know you’d quantify “shock-and-awe,” or how to identify–let alone measure–attempts to scare one’s opponent. Or screwed-ness, for that matter. But if we take “screwed” to mean the same as “not very likely to win,” we’ve got something we can check.

Many fans would agree with the general claim that American men tend to have big serves, aggressive game styles, and not a whole lot of subtlety. Certainly John Isner fits that mold, and Sam Querrey doesn’t deviate much from it. While Fritz is a big hitter who racks up his share of aces and second-shot putaways, his style isn’t so one-dimensional.

Taylor Fritz: not screwed

Using data from the Match Charting Project, I calculated some rally-length stats for the 70 men with at least 20 charted matches in the last decade. That includes five Americans (Fritz, Isner, Querrey, Steve Johnson, and Jack Sock) and most of the other guys we think of as ATP tour regulars.

Safin’s implied definition is that rallies of four shots or fewer are “shock-and-awe” territory, points that are won or lost within either player’s first two shots. Longer rallies are, supposedly, the points where the Americans lose the edge.

That is certainly the case for Isner. He wins only 40% of points when the rally reaches a fifth shot, by far the worst of these tour regulars. Compared to Isner, even Nick Kyrgios (44%) and Ivo Karlovic (45%) look respectable. The range of winning percentages extends as high as 56%, the mark held by Nikoloz Basilashvili. Rafael Nadal is, unsurprisingly, right behind him in second place at 54%, a whisker ahead of Novak Djokovic.

Fritz, at 50.2%, ranks 28th out of 70, roughly equal to the likes of Gael Monfils, Roberto Bautista Agut, and Dominic Thiem. Best of all–if you’re a contrarian like me, anyway–is that Fritz is almost 20 places higher on the list than Khachanov, who wins 48.5% of points that last five shots or more.

More data

Here are 20 of the 70 players, including some from the top and bottom of the list, along with all the Americans and some other characters of interest. I’ve calculated each player’s percentage of points won for 1- or 2-shot rallies (serve and return winners), 3- or 4-shot rallies (serve- and return-plus-one points), and 5- or more-shot rallies. They are ranked by the 5- or more-shot column:

Rank  Player                 1-2 W%  3-4 W%  5+ W%  
1     Nikoloz Basilashvili    43.7%   54.1%  55.8%  
2     Rafael Nadal            52.7%   51.6%  54.3%  
3     Novak Djokovic          51.8%   54.6%  54.0%  
4     Kei Nishikori           45.5%   51.2%  53.9%  
11    Roger Federer           52.9%   54.9%  52.1%  
22    Philipp Kohlschreiber   50.1%   50.1%  50.7%  
28    Taylor Fritz            51.1%   47.2%  50.2%  
30    Jack Sock               49.0%   46.5%  50.2%  
31    Alexander Zverev        52.8%   50.3%  50.0%  
32    Juan Martin del Potro   53.8%   49.1%  50.0%  
34    Andy Murray             54.3%   49.5%  49.4%  
39    Daniil Medvedev         53.9%   50.4%  49.0%  
43    Stefanos Tsitsipas      51.4%   50.5%  48.6%  
44    Karen Khachanov         53.7%   48.1%  48.5%  
48    Steve Johnson           49.2%   48.8%  48.3%  
61    Sam Querrey             53.5%   48.0%  46.2%  
62    Matteo Berrettini       53.6%   49.3%  46.1%  
66    Ivo Karlovic            51.8%   43.9%  44.9%  
68    Nick Kyrgios            54.6%   47.4%  44.2%  
70    John Isner              52.3%   48.3%  40.2%

Fritz is one of the few players who win more than half of the shortest rallies and more than half of the longest ones. The first category can be the result of a strong serve, as is probably the case with Fritz, and is definitely the case with Isner. But you don’t have to have a big serve to win more than half of the 1- or 2-shot points. Nadal and Djokovic do well in that category (like they do in virtually all categories) in large part because they negate the advantage of their opponents’ serves.

Shifting focus from the Americans for a moment, you might be surprised by the players with positive winning percentages in all three categories. Nadal, Djokovic, and Roger Federer all make the cut, each with plenty of room to spare. The remaining two are the unexpected ones. Philipp Kohlschreiber is just barely better than neutral in both classes of short points, and a bit better than that (50.7%) on long ones. And Alexander Zverev qualifies by the skin of his teeth, winning very slightly more than half of his long rallies. (Yes, that 50.0% is rounded down, not up.) Match Charting Project data is far from complete, so it’s possible that with a different sample, one or both of the Germans would fall below the 50% mark, but the numbers for both are based on sizable datasets.

Back to Fritz, Isner, and company. Safin may be right that the Americans want to scare you with a couple of big shots. Isner has certainly intimidated his share of opponents with the serve alone. Yet Fritz, the player who prompted the comment, is more well-rounded than the Russian captain gave him credit for. Khachanov won the match on Sunday, and at least at this stage in their careers, the Russian is the better player. But not on longer rallies. Based on our broader look at the data, it’s Khachanov who should try to avoid getting dragged into long exchanges, not Fritz.

Podcast Episode 82: ATP Cup and WTA Season Preview

Episode 82 of the Tennis Abstract Podcast tests out a new format for the new year, featuring co-host Carl Bialik, of the Thirty Love podcast, and guest Jeff McFarland of Hidden Game of Tennis.

The three of us dig into the new ATP Cup, considering whether the format is appealing to players and fans, how we should feel about odd matchups between players hundreds of ranking places apart, and–most importantly–what captains should be doing with the stats available to them.

We also look at the top of the WTA ranking table, considering whether Ashleigh Barty will continue her reign for another twelve months, or if Bianca Andreescu–or Karolina Pliskova–will topple her. We also debate where Caroline Wozniacki stands among Open-era greats, as one of the few women to hang on to the number one ranking for more than a full year.

Thanks for listening!

(Note: this week’s episode is about 66 minutes long; in some browsers the audio player may display a different length. Sorry about that!)

Click to listen, subscribe on iTunes, or use our feed to get updates on your favorite podcast software.