The fledgling field of tennis analytics can seem rather chaotic, with scores of mini-studies that don’t fit together in any obvious way. Some seem important but unfinished while others are entertaining but trivial.
Let me try to impose some structure on this project by classifying research topics into what I’ll call the Five Big Questions, each of which is really just an umbrella for hundreds more (like these). As we’ll see, there are really six categories, not five, which just goes to show: analytics is about more than just counting.
1. What’s the long-term forecast?
Beyond the realm of the next few tournaments, what does the evidence tell us about the future? This question encompasses everything from seasons to entire careers. What are the odds that Roger Federer reclaims the No. 1 ranking? How many Grand Slams will Nick Kyrgios win? How soon will Catherine Bellis crack the top ten?
The most important questions in this category are the hardest ones to answer: Given the limited data we have on junior players, what can we predict–and with what level of confidence–about their future? These are questions that national federations would love to answer, but they are far from the only stakeholders. Everyone from sponsors to tournaments to the players’ families themselves have an interest in picking future stars. Further, the better we can answer these questions, the more prepared we can be for the natural follow-ups. What can we (as families, coaches, federations, etc.) do to improve the odds that a player succeeds?
2. Who will win the next match?
The second question is also concerned with forecasting, and it is the subject that has received–by far–the most analytical attention. Not only is it fun and engaging to try to pick winners, there’s an enormous global industry with billions of dollars at stake trying to make more accurate forecasts.
As an analyst, I’m not terribly interested in picking winners for the sake of picking winners. More valuable is the quest to identify all of the factors that influence match outcomes, like the role of fatigue, or a player’s preference for certain conditions, or the specifics of a given matchup. Player rating systems fall into this category, and it’s important to remember they are only a tool for forecasting, not an end to themselves.
As a meta-question in this category, one might ask how accurate a set of forecasts could possibly become. Or, posed differently, how big of a role does chance play in match outcomes?
3. When and why does the i.i.d. model break down?
A lot of sports analysis depends on the assumption that events are “identically and independently distributed”–i.e. factors like streakiness, momentum, and clutch are either nonexistent or impossible to measure. In tennis terms, the i.i.d. model might assume that a player converts break points at the same rate that she wins all ad-court points, or that a player hold serve while serving for the set just as often as he holds serve in general.
The conventional wisdom strongly disagrees, but it is rarely consistent. (“It’s hard to serve for the set” but “this player is particularly good when leading.”) This boils down to yet another set of forecasting questions. We might know that a player wins 65% of service points, but what are her chances of winning this point, given the context?
I suspect that thorough analysis will reveal plenty of small discrepancies between reality and the i.i.d. model, especially at the level of individual players. More than with the first two topics, the limited sample sizes for many specific contexts mean we must always be careful to distinguish actual effects from noise and look for long-term trends.
4. How good is that shot?
As more tennis data becomes available in a variety of formats, the focus of tennis analytics will become more granular. The Match Charting Project offers more than 3,000 matches worth of shot-by-shot logs. Even without the details of each shot–like court position, speed, and spin–we can start measuring the effectiveness of specific players’ shots, such as Federer’s backhand.
With more granular data on every shot, analysts will be able to be even more precise. Eventually we may know the effect of adding five miles per hour to your average forehand speed, or the value of hitting a shot from just inside the baseline instead of just behind. Some academics–notably Stephanie Kovalchik–have begun digging into this sort of data, and the future of this subfield will depend a great deal on whether these datasets ever become available to the public.
5. How effective is that tactic?
Analyzing a single shot has its limits. Aside from the serve, every shot in tennis has a context–and even serves usually form part of the backdrop for other shots. Many of the most basic tactical questions have yet to be quantified, such as the success rate of approaching to the backhand instead of the forehand.
As with the previous topic, the questions about tactics get a lot more interesting–and immensely more complicated–as soon as Hawkeye-type data is available. With enough location, speed, and spin data, we’ll be able to measure the positions from which approach shots are most successful, and the type (and direction) that is most effective from each position. We could quantify the costs and benefits of running around a forehand: How good does the forehand have to be to counteract the weaker court position that results?
We can scrape the surface of this subject with the Match Charting Project, but ultimately, this territory belongs to those with camera tracking data.
6. What is the ideal structure of the sport?
Like I said, there are really just five questions. Forecasting careers, matches, and points, and quantifying shots and tactics encompass, for me, the entire range of “tennis analytics.”
However, there are plenty of tennis-related questions that we might assign to the larger field of “business of sports.” How should prize money be distributed? What is the best way to structure the tour to balance the interests of veterans and newcomers? Are there too many top-level tournaments, or too few? What the hell should we do with Davis Cup, anyway?
Many of these issues are–for now–philosophical questions that boil down to preferences and gut instincts. Controlled experiments will always be difficult if only because of the time frames involved: If we change the Davis Cup format and it loses popularity, is it causation or just correlation? We can’t replicate the experiment. But despite the challenges, these are major questions, and analysts may be able to offer valuable insights.
Now … let’s get to work.