blog record — Mar 2026
AI Tennis Coach
Match film in, rally trace + coaching report out. Catches what I miss watching live.
I forget what happened
I lose to the same patterns over and over and never catch it live. By the time I sit down, the next server is bouncing the ball. Three points later I couldn't tell you if I lost the last one on a forehand into the net or a backhand long.
Film review doesn't fix it either. Recall bias toward winners: I remember the one clean down-the-line winner, not the seven forehands I shanked into the fence at 4-4. No counting: even watching carefully, I can't tell you my average rally length or what fraction of my UEs came on the 5th-plus shot. The numbers are in the footage. My brain just doesn't do math.
So I built a thing that does the counting, then hands the numbers to Claude to turn into a coaching writeup. That's the whole project.
How it works
The pipeline is intentionally boring. Each stage writes a JSON artifact the next one reads, so I can swap any piece without retraining anything else.
When the coaching report said something weird I could open the rally trace and find the bad frame in seconds.
Player and ball detection
Started with off-the-shelf YOLOv8n[1] so I could run CPU inference on my MacBook. Out of the box it found players fine (people are a COCO class) but hallucinated a tennis ball ~40% of the time and missed it the rest. A tennis ball at 1080p from baseline is a 6x6 pixel smear, and at 80 mph it ghosts across 12-15 pixels per frame.
Labeled ~1,200 frames from my own matches (CVAT, two evenings, one beer) and fine-tuned YOLOv8n with ball as a new class. Two things mattered more than model choice.
Frame rate. Shooting at 60 fps roughly halved track fragmentation vs 30 fps. At 30 the ball jumps far enough between frames that the tracker drops association on hard groundstrokes.
Kalman on the ball track. YOLO still drops the ball on contact frames and against dark backgrounds. A constant-velocity Kalman filter on the centroid coasts through up to 8 missing frames (~130 ms). Anything longer and I declare the track dead and re-acquire.
For players I run ByteTrack[5] on top of YOLO. Two players, fixed camera, baseline view, it basically never swaps IDs. One edge case: a player chasing a wide ball partway out of frame. I hold the last known position for 0.5 s before re-initializing.
End-to-end detection runs at ~84 fps on an M2 Pro, batch size 1. A 90-minute match processes in ~32 minutes.
Court homography
To turn pixel coords into court coords I need a planar homography from the tripod view to a top-down mini-court. A tennis court is a known rigid rectangle (23.77 m by 10.97 m for doubles), so I get ground-truth correspondences as soon as I find four court points in the image.
I detect the four service-box corners with a thin Hough-line pass on the green channel, then solve for H with OpenCV's findHomography[2] under RANSAC. RANSAC matters because players' feet sit on top of the lines I'm trying to detect, and naive least-squares gets dragged around by those outliers.
With H in hand, every foot-position and every ball-bounce projects to court meters via (x, y) = (u'/w', v'/w'). Calibration error vs manually annotated frames is ±18 cm at the near baseline and ±35 cm at the far baseline from foreshortening. Good enough for heatmaps, not good enough for line calls.
I re-fit H once per game (between odd games, when players change ends) to absorb tripod drift. First version skipped this and the coverage heatmap slowly slid off-court over an hour.
Shot detection and speed
A shot event is the moment a racket strikes the ball. I don't try to detect rackets (too small, too motion-blurred). I infer the shot from the ball trajectory.
def is_shot_event(ball_track, i, fps=60):
# Look at ball velocity vectors before and after frame i
v_before = ball_track[i].pos - ball_track[i-3].pos
v_after = ball_track[i+3].pos - ball_track[i].pos
# Direction change in court-plane radians
cos_theta = (v_before @ v_after) / (norm(v_before) * norm(v_after) + 1e-6)
angle = math.acos(clip(cos_theta, -1, 1))
# A shot reverses ball direction sharply AND happens near a player
near_player = min_dist_to_player(ball_track[i].pos) < 1.2 # meters
return angle > math.radians(110) and near_playerShot detection via ball direction reversal.
Bounces use the same trick with the constraint that vertical pixel velocity flips sign while horizontal velocity continues. A small state machine ( (serve) → bounce → shot → bounce → shot → ... ) and any deviation ends the rally.
Speed is pixel_velocity × meters_per_pixel at the contact location, where meters_per_pixel comes from the local Jacobian of H. Honest error bars: my buddy stood behind the fence with a radar gun for three rallies. Flat groundstrokes came in within ±6 mph, but I underestimate heavy topspin by 8-12 mph because the real 3D arc is taller than my 2D projection assumes. I show ±10% confidence bands in the dashboard instead of pretending to be Hawk-Eye.
Segmenting rallies
A rally starts at the serve toss (vertical ball trajectory near a player behind the baseline) and ends on one of: two bounces on one side with no shot in between (out / not returned), the trajectory dying in the net region, or no ball detection for > 1.0 s after the last shot.
Edge cases. Lets: second-serve toss within 5 s of the previous serve, merged. Challenges and pauses: long detection gaps with both players stationary, close the rally and drop the dead time. Practice serves between games: filtered by requiring a rally start to follow a score-pause longer than 8 s. Still get a few false length-1 rallies I drop in post.
Each rally emits one record: shot count, duration, max shot speed, player positions at 10 Hz, and outcome (winner / unforced / forced). The outcome classifier is rule-based and I'm not happy with it. Want to swap for a small learned head eventually.
The coaching report
This is the part I expected to be a "throw the video at Claude" call, and the part where I aggressively did the opposite.
Claude[3] never sees pixels. It gets a compact JSON summary of the match (typically 5-7 KBregardless of match length) plus a system prompt defining what a coaching report should look like.
{
"match_id": "2026-03-08_vs_kevin",
"duration_min": 78,
"rallies": [
{
"id": 42,
"server": "me",
"shots": 9,
"duration_s": 14.2,
"max_speed_mph": 71,
"outcome": "unforced_error",
"ending_shot": {"player": "me", "type": "forehand", "court_zone": "deuce_deep"},
"my_court_coverage_m": 31.4,
"opp_court_coverage_m": 22.1
}
],
"aggregates": {
"avg_rally_len": 4.6,
"ue_rate_by_rally_len": {"1-3": 0.12, "4-6": 0.19, "7+": 0.41},
"fh_vs_bh_ue_ratio": 1.8,
"coverage_heatmap_grid": "[base64 9x6 floats]"
}
}The rally trace Claude actually sees.
Two reasons this beats "describe this video." Tokens: a 90-minute match at 1 fps is ~5,400 frames for a vision model. A 6 KB JSON blob is ~1,800 tokens. Two orders of magnitude cheaper. Grounding: when Claude says "your UE rate jumps to 41% on long rallies," that number is literally in the input. It's narrating a statistic, not inferring one from blurry pixels. Hallucination on quantitative claims drops to near zero.
The prompt is basically: "You're a tennis coach. Here's a structured match summary. Find the 3 most actionable patterns, ground each in a specific number from the input, suggest one drill per pattern." Claude is great at narration and would be much worse at counting. Let each model do what it's good at.
What the pipeline found
Ran it on 14 of my own matches over two months, ~220 rallies. The dashboard shows rally-length distribution (peaks at 3 shots, long tail to 18), a 9x6 coverage heatmap, UE rate by rally length, and shot-speed by stroke.
The big surprise: my UE rate is 2.2x higher on rallies of 7+ shots than on 3-6. Watching live I'd have told you my long-rally game was a strength. Counted properly, the opposite is true. I get impatient and pull the trigger on a forehand that isn't there. Claude flagged this on the first batch and prescribed cross-court forehand consistency, 20-ball sets, no winners allowed. Three weeks in, the 7+ UE rate is down to 0.31. Small sample, right direction.
Second pattern I'd missed: my coverage is heavily biased to the deuce side, with a 1.4 m gap on the ad-side baseline opponents have been exploiting. I had a vague sense of this. Seeing the grid made it impossible to ignore.
What this thing isn't
Occlusion is the biggest source of error. When the ball crosses a dark-shirted player, YOLO drops it and Kalman coasts. On ~4% of shots I lose the contact frame entirely and have to interpolate, which corrupts the speed estimate.
No height. The top-down homography assumes the ball is on the court plane, which is only true at the bounce. Topspin is systematically underestimated. A second camera would fix it. So would a learned 3D ball-trajectory model[4]. Haven't wanted either badly enough yet.
Ball blur at 1080p. At higher speeds the ball is a 3-pixel streak and YOLO confidence drops below threshold. 4K would help. My phone storage would not.
Rule-based outcome classifier. Want to train a small head on rally features to call winner vs forced vs unforced. Needs a few hundred labeled rallies, which is a Saturday I haven't spent.
Doubles is unsupported. Four-player tracking is fine, but the rally state machine assumes one player per side and the shot heuristics break with poaching.
/ footnotes
- [1]Ultralytics YOLOv8 — docs.ultralytics.com/models/yolov8. ↩
- [2]OpenCV
findHomographywith the RANSAC variant — docs.opencv.org/tutorial_homography. ↩ - [3]Anthropic Claude API, used for the coaching report generation — anthropic.com. ↩
- [4]Huang et al., TrackNet: A Deep Learning Network for Tracking High-speed and Tiny Objects in Sports Applications — arxiv.org/abs/1907.03698. ↩
- [5]ByteTrack (Zhang et al., 2022), used for player ID tracking — github.com/ifzhang/ByteTrack. ↩