Category: Projects

Fourth Down Is Still Football’s Biggest Coaching Problem
I analyzed 107,000 fourth-down decisions since 1999. The findings are troubling for NFL fans.

The Play That Changed Everything

November 15, 2009. The New England Patriots lead the Indianapolis Colts 34-28. Two minutes and eight seconds remain. Peyton Manning — arguably the best quarterback in football at that moment — is standing on the sideline. The Patriots have the ball on their own 28-yard line. It is fourth and two.

Bill Belichick waves off the punter.

The decision is so unexpected that some Patriots players jog onto the field before being waved back. Tom Brady takes the snap and swings a short pass to Kevin Faulk, who catches it, gets driven backward, and is spotted one inch short of the first down marker. The Colts take over at the New England 29. Four plays later, Manning hits Reggie Wayne for a one-yard touchdown with 13 seconds left. Indianapolis wins 35-34.

The backlash was immediate and overwhelming. Rodney Harrison, a former Patriot turned NBC analyst, called it “the worst coaching decision I’ve ever seen Bill Belichick make.” ESPN ran columns. Talk radio spent days on it. The consensus was clear: Belichick had gambled, lost, and cost his team the game.

There was just one problem. He was right.

Win probability models — the same framework I used to build this analysis — show that going for it gave New England a 79% chance of winning. Punting would have dropped that to roughly 70%. Belichick took the higher-percentage path. Brady’s pass was caught. The spot was bad. The play failed. The decision didn’t.

This gap between outcome and decision value is central to this article. Its implications extend beyond one night in November.

What Win Probability Actually Measures

Before going further, it’s worth explaining the metric that drives this entire analysis: Win Probability Added (WPA).

At any moment in an NFL game, a win probability model calculates how likely a team is to win based on score, field position, down, distance, and time remaining. These models are trained on thousands of historical games, so when you’re up 7 with 4 minutes left on your own 40-yard line, the model can say — based on how similar situations have resolved historically — that you win about 82% of the time.

WPA simply measures how much a single play moves that needle:

WPA = Win Probability (after play) − Win Probability (before play)

A great conversion on 4th & 1 might add +0.08 WPA. A turnover on 4th & goal might cost −0.15 WPA. The number is always relative to what you had before.

For grading decisions specifically, I compute the gap between the optimal decision’s expected WPA and the actual decision’s expected WPA in a given game state. That gap — averaged across all of a coach’s decisions — is what I call the Decision Quality Score, or DQS. Lower is better.

DQS (Decision Quality Score) Average WPA gap between optimal call and actual call Lower = better
ODR (Optimal Decision Rate) % of plays where coach made the historically optimal call Higher = better

Before getting to the field map, one number sets the table. When NFL teams actually go for it on 4th down, how often do they convert?

Raw 4th down conversion rates by yards to go, 1999–2025. On 4th & 1, teams convert two-thirds of the time. Even on 4th & 4–6, the success rate is nearly a coin flip.

Two-thirds of the time on 4th & 1. More than half the time on 4th & 2–3. Even at medium distance — 4th & 4–6 — teams succeed on 43% of attempts, nearly coin-flip odds. These aren’t low-percentage gambles. They’re manageable probabilities that coaches have been systematically undervaluing. That’s the key to understanding why the WPA math works the way it does.

This analysis yields a field-position cheat sheet, with four versions based on yards-to-go categories:

The historically optimal 4th down call at every field position for each yards-to-go category, averaged across all game situations. The zones are more aggressive than most fans — and most coaches — expect.

The green covers far more of the field than conventional wisdom would suggest. On 4th & 1–3, the data says go for it from your own 20-yard line all the way to the opponent’s end zone. On 4th & 4–6, the go zone still stretches from midfield to the opponent’s 20. Even on 4th & 7+, going for it is optimal at midfield — a call most coaches would never make. Most coaches aren’t operating anywhere close to this. The next two sections show exactly how far they’re missing.

The NFL’s Quiet Revolution — And Its Limits

Belichick’s fourth-and-two decision in 2009 didn’t just generate controversy. As Kevin Clark wrote in The Ringer in 2019, it started a conversation — one that analytics departments across the league had been waiting years to have. Teams began adding win-probability consultants. Front offices that had dismissed the math suddenly had a famous case study to point to.

But here is the part of the story that rarely gets told: the actual coaching behavior didn’t budge for nearly a decade. From 2009 through 2016, the league-wide go-for-it rate bounced between 12% and 13% — almost exactly where it had been in 2008. The conversation changed immediately. The calls didn’t. The real inflection came in 2017 and 2018, when the rate jumped from 13% to 18% over two seasons, driven by a new generation of coaches who had already adopted analytics. The play that supposedly changed the NFL took eight years to actually change how the NFL coached.

NFL-wide go-for-it rate on fourth down, 1999–2025. The analytics revolution is real — but it arrived late and has plateaued.

In 1999, NFL teams went for it on fourth down just 11.2% of the time. By 2025, that number had nearly doubled to 22.0%. That sounds like enormous progress — and in some ways it is. But consider what 22% actually means: coaches are still punting or kicking on more than three out of every four fourth downs. Given how often the data says they should be going for it, that conservative baseline tells a story about how deeply ingrained the old instincts remain.

Not every coach has resisted the shift. John Harbaugh set what Football Outsiders described as an all-time record for fourth-down aggressiveness in 2019, and was publicly celebrated by the analytics community for openly deploying win-probability models on game-day decisions. Across 17 seasons and 2,253 fourth-down decisions in my dataset, he ranks 27th out of 167 qualifying coaches — not because he goes for it constantly (his career go-for-it rate is 13.6%), but because he consistently picks the right spots. Ron Rivera — “Riverboat Ron,” a nickname earned after a string of aggressive fourth-down calls in 2013 — was one of the first coaches to publicly credit the NYT’s Fourth Down Bot for changing how he thought about the game, and ranks 37th in my analysis. In the current era, Dan Campbell has pushed it furthest: his Detroit Lions go for it on 28.4% of fourth downs, the highest rate among established head coaches in the analytics era. My analysis ranks him 9th out of 167 coaches, with a 73% optimal decision rate — the aggression isn’t reckless, it’s accurate. But Campbell and Harbaugh are outliers. The median coach still punts.

Average coach decision quality improved about 28% between 1999 and 2025. While this is progress, even today, coaches still make the wrong call on more than a quarter of fourth downs.

What Bad Decisions Actually Cost

Progress tells part of the story. The cost of imperfection tells the rest.

Each incorrect decision carries a WPA cost — the expected win probability that evaporates when a coach punts instead of going for it, or kicks a field goal in a situation where the data says go. Add those costs up across every fourth down in a season, and the number is striking.

Total WPA left on the table each season from suboptimal 4th down decisions, 1999–2025. The gap is narrowing — but it hasn’t closed.

In 1999, NFL teams collectively left 37.1 WPA on the table in a single season from bad fourth-down calls alone. By 2025, that number had dropped to 25.7 WPA. Spread across 32 teams, that’s still roughly 0.8 WPA per team per season — the equivalent of nearly one free win, unclaimed, every year. That is the difference, in many cases, between a playoff berth and a missed one.

Improvement deserves recognition, but the gap persists. Coaches are more accurate in clear-cut situations, but the challenge lies in less obvious circumstances.

Where Coaches Go Wrong Most Often

The most mismanaged fourth-down situations are not the ones that feel risky. They’re the ones that feel safe — where decades of conventional wisdom have convinced coaches that the conservative call is obviously correct. The data says otherwise.

The wrong-call heatmap below shows where errors cluster across the entire field:

Wrong-call rate by field position and yards to go. Darker red = coaches make the suboptimal call more often.

Strip away the percentages and the heatmap has only two modes: the situations coaches handle correctly, and everything else. One row of the map — deep in a team’s own territory, beyond the 80 — is almost entirely white, with a 0–2% wrong rate. That’s the row where the right call is obvious: punt. One corner, long fourth downs in the red zone, is also light. Kicking a field goal when you’re 7+ yards out at the goal line is fine.

Everywhere else, coaches are wrong roughly half the time. Midfield at any distance. 4th & 1 from your own 30. 4th & 2–3 in the red zone. 4th & 4–6 in opponent territory. These aren’t scattered hot spots — they are the majority of the field, sitting at 45–55% wrong-call rates that are visually indistinguishable from each other. Outside of the narrow band of “obviously punt” and “obviously kick,” NFL coaches are effectively coin-flipping the decision.

Two situations are worth singling out because they cut directly against entrenched conventional wisdom.

4th & 2–3 in the red zone (55% wrong-call rate). When a team faces fourth and short inside the opponent’s 20, coaches kick a field goal more than half the time even though going for it is the historically better decision. The logic is intuitive — you’re in scoring position, take the guaranteed points — but it misses a crucial asymmetry. Convert on 4th & 2 in the red zone and you’re likely scoring a touchdown, worth four more points than the field goal you left behind. Fail, and your opponent takes over inside their own 20 with no realistic path to immediate points. The WPA math consistently favors going for it. Coaches consistently don’t.

4th & 1 from your own side of the field (52% wrong-call rate). Fourth and inches from your own 30 or 40 is routinely treated as an automatic punt. It shouldn’t be. Conversion rates on 4th & 1 are roughly two-thirds league-wide, and a successful conversion extends a drive that is already in a neutral or slightly favorable field-position context. Punting trades a ~67% shot at keeping possession for the certainty of giving the ball back. Half the time, coaches take the certainty.

One pattern runs through all of it: once the situation leaves the “obvious” column, coaches systematically default to the conservative option. The field goal feels safe. The punt feels responsible. The numbers say both instincts, applied too broadly, are costing teams games.

Who’s Doing It Right

The flip side of the wrong-call story is the list of coaches who consistently get it right. When I ranked all 167 qualifying head coaches on decision quality, the names at the top weren’t necessarily the household ones.

Nick Sirianni ranks second. The Eagles’ head coach made the historically optimal call 76% of the time across 617 decisions — a higher accuracy rate than any coach typically associated with football wisdom. Matt LaFleur of the Packers ranks fourth. Sean McVay sits 21st overall, but his 2025 season ranks among the best individual coaching seasons in my 27-year dataset. Dan Campbell — already noted for his aggression — lands ninth, and his 2022 season with Detroit was similarly elite.

The single most surprising finding came from Andy Reid. In 1999, his first NFL head-coaching season, Reid went for it on fourth down just 5.6% of the time. By 2025, at age 67, his Chiefs went for it 24.2% of the time — more than four times his career starting point — with an optimal-decision rate of 79%. Reid rebuilt his coaching philosophy in public, season by season, and won three Super Bowls along the way. It’s the cleanest case study of late-career adaptation in the data.

And at the very top: the best single-season decision-quality mark in my entire dataset belongs to a head coach who had never held the job before. Brian Schottenheimer, in his rookie year as the Cowboys head coach in 2025, went for it on 29.5% of fourth downs with an optimal-decision rate of 82% across 105 qualifying decisions. Highest accuracy, nearly-highest aggression, zero prior head-coaching experience. It is one season — small sample, and he will have to prove it over time — but in 27 years of data, no rookie head coach has matched it.

I built a Coach Explorer that lets you sort and filter all 167 coaches by era, aggression, and decision quality. If you want to argue with the rankings, start there.

Tool loading slowly? Open in a new tab →

Explore It Yourself

The analysis above is built on historical averages, which is where it has to start. But fourth down is situational, and every game is different. I built a 4th Down Decision Calculator that takes any specific scenario — field position, yards to go, score, time remaining — and returns the historically optimal call, along with what real NFL coaches actually chose in that exact situation. If you want to start somewhere familiar, type in the Belichick scenario.

The second tool, a Decision Boundary Map powered by an XGBoost model trained on all 107,000 plays, visualizes optimal calls across every field position and yards-to-go combination simultaneously, updating in real time as you move the game-state sliders.

Tool loading slowly? Open in a new tab →

How I Built This

A note on methodology for the analytically curious.

The core of this analysis is a WPA baseline framework built on nflfastR play-by-play data covering every NFL regular season from 1999 to 2025 — approximately 107,000 fourth-down plays, after filtering for complete situational data.

Game state binning. Each play is assigned to a game-state bucket defined by four dimensions: field position (5 bins), yards to go (4 bins), score differential (7 bins), and time remaining (5 bins). This yields up to 700 unique game states, though many are sparsely populated. Buckets with fewer than 10 plays for a given decision type are excluded from the baseline to avoid noise-driven conclusions.

Recency weighting. Rather than treating all 26 seasons equally, each play is weighted by:
```
w = 0.85^(2025 − season)
```
This means 2025 data carries full weight (w = 1.0), 2024 carries 0.85, 2020 carries ~0.44, and 1999 carries ~0.03. The decay factor of 0.85 was chosen to reflect the meaningful philosophical shift in coaching that began around 2010, while still preserving enough historical signal to make rare game states statistically stable.

Optimal decision labeling. Within each game-state bucket, I compute the weighted average WPA for each decision type. The decision with the highest weighted mean WPA is labeled optimal. Every play is then tagged with whether the actual decision matched the optimal label (made_optimal = True/False) and the gap between optimal WPA and actual WPA (decision_gap).

Coach grading. DQS is the mean decision_gap across all of a coach’s plays. ODR is the proportion of plays where made_optimal = True. Coaches with fewer than 50 qualifying decisions are excluded. Grades are assigned by head coach name rather than offensive coordinator, on the basis that head coaches set the fourth-down philosophy even when play-calling duties are delegated.

The ML model. The Decision Boundary Map uses three XGBoost regressors — one each for go-for-it, punt, and field goal — trained on a temporal split (1999–2023 training, 2024–2025 held out). Features are exact rather than binned: yardline_100, ydstogo, score_differential, game_seconds_remaining, and season_norm (season scaled 0–1 over the full range). The temporal split is critical: a random split would leak future play patterns into the training set, inflating apparent model performance.

All code, notebooks, and Streamlit apps are open source. 🔗 GitHub Repository

The Gap Is Closing. It Hasn’t Closed.

Belichick’s 4th & 2 in 2009 is remembered as a gamble that failed. The more accurate reading is that it was a correct decision that happened to produce a bad outcome — and that distinction matters.

Outcome-based thinking is how most coaches still operate. If you go for it and convert, the decision looks brilliant. If you fail, it looks reckless. But win probability doesn’t care about outcomes in isolation. It cares about expected value — what decision, made consistently across thousands of similar situations, produces the best results over time.

The league has moved meaningfully in the right direction since 2009. The go-for-it rate has nearly doubled. The average DQS has improved by 28%. The total WPA left on the table has fallen by roughly a third. That’s genuine progress, driven by a generation of coaches who grew up watching analytics departments gain influence in front offices.

But 25.7 WPA still disappears every season from suboptimal calls. Coaches still make the wrong decision on 4th & short in the red zone more than half the time. The most conservative coaches in the league still punt in situations where the math is unambiguous.

Belichick made the right call on a November Sunday in 2009. The league has had sixteen years to catch up. Most of it still hasn’t.

Data source: nflfastR via nflverse, 1999–2025 regular seasons. Full methodology and source code: github.com/shanethakkar/nfl-4th-down-analysis.
April 22, 2026
Who Is Actually the Best F1 Driver? A Bayesian Approach to Separating Skill from the Car
There is a debate that has run through Formula 1 for decades, resurfacing whenever a dominant driver wins another championship: how much of that success belongs to the driver, and how much to the car?

The most common version of this argument targets Lewis Hamilton. Seven world championships, 103 race wins — but was he truly the greatest driver of his generation, or was he simply the beneficiary of the most dominant car in the sport’s history? Fans of Fernando Alonso, in particular, have argued for years that had the Spaniard been given the same machinery, the story might have looked very different.

You cannot answer this by looking at the championship standings. In Formula 1, the car is not a constant — it is arguably the single biggest variable on the grid. Two drivers on the same grid are not competing on equal terms. They never are. That is what makes F1 uniquely difficult to analyze and uniquely interesting to try.

This project attempts to answer the question properly, using a Bayesian hierarchical model to isolate each driver’s individual contribution to race outcomes after controlling for the car they drove.

A Quick Primer on How F1 Works

If you follow F1 closely, skip ahead. For everyone else, here is what you need to know.

Formula 1 has two layers of competition: constructors (the teams that build the cars) and drivers (the people who race them). Each constructor fields two drivers per race. The championship is contested by both — there is a Drivers’ Championship and a separate Constructors’ Championship.

The critical point is this: the car matters enormously. Unlike football, tennis, or basketball, where the equipment is standardized, each F1 team designs and builds its own car to its own specifications. In any given season, one or two teams will produce a car that is simply faster than everyone else’s. The drivers in those cars will win races not necessarily because they are better drivers, but because they are in better machinery.

To illustrate how much this matters: this model found that the 2022 Red Bull had a team-season effect of -1.93 — meaning their car alone finished nearly 2 positions better than the starting grid historically produces. Before Max Verstappen or Sergio Perez did anything at all, the car was already doing the work. More on what that number means shortly.

This is the problem. And it is why simply counting wins and championships is a poor proxy for driver skill.

The Approach: Measuring “Value Added on Sunday”

The core idea behind this model is to stop asking “where did the driver finish?” and start asking: did the driver finish better or worse than we would expect, given where they started?

Think of it as measuring not the result, but the surprise in the result. We call this the driver’s “Value Added on Sunday” — how much better or worse did they perform relative to what their starting position and car would historically produce?

Step 1 — The Expected Finish Benchmark

For each starting position (1st through 20th) in each season, we compute the historical average finishing position for all drivers who started from that slot that year. This becomes the benchmark — what a typical driver in a typical car historically does from that grid position in that specific season.

For example, in 2023, the benchmarks looked like this for a few key grid slots:

Starting Position 2023 Expected Finish
P1 (1st) 1.62
P8 (8th) 8.76
P15 (15th) 10.69
P20 (20th) 14.33

Notice that P20 expects to finish around P14 on average — this reflects the reality that a significant number of cars retire or get passed throughout a race, so starting last does not mean finishing last.

The benchmark is computed within each season, not pooled across all years. A 7th place (P7) starting position in 2014 is a completely different competitive reality from P7 in 2023. Pooling them would introduce era-blending bias before the model even runs.

Step 2 — The Residual

Once we have the benchmark, we compute a number called the residual — the gap between what happened and what we expected — for every driver in every race:
```
finish_residual = finishing_position - expected_finish
```
A negative residual is good — the driver finished better than expected. A positive residual is bad — they finished worse.

A concrete example using real 2023 data:

Verstappen, Round 1 (Bahrain): Starts P1, finishes P1. The 2023 benchmark for P1 is 1.62, so his residual is 1 − 1.62 = −0.62. A small negative — he did what was expected, with a slight outperformance.

Perez, Round 3 (Australia): Starts P20 (last place), finishes P5. The 2023 benchmark for P20 is 14.33, so his residual is 5 − 14.33 = −9.33. A massive negative — he dramatically outperformed what anyone starting last would historically achieve.

Both drivers are in the same Red Bull. The car did not explain the difference between these two residuals. Something else did.

Step 3 — Separating Driver from Car

Here is where the model comes in. A raw residual still contains two signals mixed together — what the driver contributed and what the car contributed beyond the benchmark. These need to be separated.

A fast car does not just show up in qualifying pace — it also shows up in race pace above qualifying. In 2023, Red Bull did not just qualify near the front; they also consistently converted good starting positions into even better finishing positions. The benchmark for each grid slot is computed from all cars that ever started there, including those that finished in the midfield. So a dominant car starting P2 and finishing P1 still generates a negative residual, because the P2 benchmark includes every midfield car that has ever started second and typically fallen back.

This is why the model includes a TeamSeason Effect — a per-constructor, per-season term that absorbs the residual car advantage the benchmark did not fully capture. Only after accounting for both the benchmark and the team-season effect does the remaining signal get attributed to the driver.

In the 2023 example above, Red Bull’s TeamSeason Effect is −1.21. That means approximately −1.21 of every driver’s residual that season is attributable to the car’s race-pace advantage above what the qualifying position already captured. The remaining residual — after subtracting the car’s contribution — is what the model calls the Driver Effect.

The Model Structure

The model decomposes each race residual into three components. In plain English:
1. How good the driver is — their consistent tendency to over or underperform their grid slot across their career
2. How good the car was that season — the constructor’s race-pace advantage in that specific year, above and beyond what qualifying position already captures
3. Whether the driver crashed out through their own fault — an explicit penalty for driver-fault retirements
The math, for those interested:
```
finish_residual ~ StudentT(ν, μ, σ)μ = α + Driver_Effect + TeamSeason_Effect + β_dnf × dnf_driver_fault
```
The model is hierarchical — meaning driver effects and team-season effects are estimated jointly, each informing the other. A driver’s effect is not estimated in isolation; it is estimated in the context of every team they drove for and every season of data available. This structure lets the model share information intelligently across the dataset rather than treating each driver-season combination as completely independent.

Why StudentT and not a standard bell curve? F1 race residuals have what statisticians call “fat tails” — extreme outcomes, like finishing 10 places better than expected, happen more often than a standard bell curve would predict. Safety cars, first-lap incidents, and rain races all create these extremes. A standard Gaussian (bell curve) model would over-fit to these outliers, pulling every driver’s estimated effect in misleading directions. The StudentT distribution — a version of the bell curve with heavier tails that expects surprises more often — handles these robustly without discarding them. The model learned from the data itself that ν (the parameter controlling tail heaviness) is approximately 3.5 — confirming that F1 results are genuinely extreme relative to what a normal distribution would expect.

Why Bayesian?

A standard regression model gives you one number per driver — a point estimate. A Bayesian model gives you a probability distribution — a full range of plausible skill levels, each with an associated probability. This distinction matters for two reasons.

First, it handles uncertainty honestly. Lewis Hamilton has 243 races in this dataset. The model has a lot of data, and its estimate of his skill is tight. A rookie with 21 races gets a much wider range — the model is saying “we don’t have enough data to be confident yet.” A traditional model treats those two estimates as equally reliable. This one does not.

Second, it enables statements that are simply not possible with conventional rankings. Instead of saying “Hamilton is ranked 1st,” the model says: “There is an 85.2% probability that Hamilton’s true driver effect is better than Verstappen’s on this metric.” That is a fundamentally more honest and more useful claim.

Key Modeling Decisions

Every model involves choices. Here are the ones that shaped this analysis and why they were made.

Season-level team effects, not career-level. Red Bull in 2023 — when they won 21 of 22 races — is a fundamentally different entity from Red Bull in 2019. Treating them as one static coefficient would absorb that variance in the wrong place. Season-level indexing lets the model accurately track each constructor’s year-by-year arc.

Zero-sum constraints on both effect sets. Both Driver Effects and TeamSeason Effects are constrained to sum to zero across their respective groups. This prevents the two sets of effects from drifting against each other, rendering interpretation meaningless. Every driver comparison is expressed relative to the grid average driver. Every team comparison is relative to the average constructor-season.

DNF classification via manual audit. FastF1 provides raw status strings for every retirement. These are inconsistent across 11 seasons of data and sometimes ambiguous. Every status string was manually reviewed and assigned to one of four categories: finished, driver-fault, mechanical, or ambiguous. Driver-fault retirements (crashes, spins, collisions) carry an explicit penalty in the model. Mechanical retirements do not — a driver should not be penalized for their engine failing. The 140 rows labeled simply “Retired” with no further detail were excluded rather than misclassified. Losing 3% of the data is preferable to introducing systematic misattribution.

20-race minimum for the final rankings. The model includes all drivers in the fitting process, but the final rankings plot only shows drivers with at least 20 race starts. This decision was validated during development: Franco Colapinto ranked in the top 5 in early model runs after just 6 races (mean residual −1.94), but settled to average after 23 races (mean residual +1.31). Small samples produce wide, unstable estimates. The 20-race threshold keeps the headline output honest. Drivers near that threshold — rookies or those with short careers — will carry wider uncertainty bands than veterans. Treat their rankings as directional, not definitive.

The Findings: Driver Rankings

View interactive version →

Each bar represents a driver’s posterior distribution — the range of plausible values for their true driver effect, given all the data the model has seen. The dot is the median estimate. The thick inner bar is a 50% credible interval—the range where the model believes the true effect most likely lies. The thin outer line is the 94% credible interval — meaning the model is 94% confident the true value falls within that range. Drivers on the left side of zero consistently outperformed their grid slot. Drivers on the right consistently underperformed.

Notice how interval widths vary. Hamilton and Verstappen — with 243 and 249 races respectively — have tight, narrow intervals. The model is confident about them. Drivers near the 20-race cutoff have wide intervals spanning much of the chart. Their rankings are real estimates but carry far more uncertainty.

Hamilton at the Top

Hamilton sits at the top with a driver effect of roughly −1.5 positions — meaning on average he finishes about one and a half places better than his starting grid position would historically predict. The model estimates an 85.2% probability that his true driver effect is better than Verstappen’s on this metric.

This result holds up to scrutiny. Hamilton’s career is not just a story of dominant machinery. He won his first championship in 2008 in a McLaren that was not the fastest car that year. His 2022 and 2023 seasons — when Mercedes produced an uncompetitive car — showed a driver still extracting everything available, just with less to work with.

The comparison with Alonso is even starker: 98.6% probability that Hamilton’s driver effect is better. This is the strongest pairwise probability statement in the dataset. Hamilton has significantly more data (243 races vs. 201 for Alonso) and has shown more consistency across varied car quality. Whether that fully settles the debate between them is for the reader to decide — but it is not a number to dismiss lightly.

The Verstappen Paradox

The model ranks Sergio Perez 2nd and Max Verstappen 3rd. The model gives only a 21.4% probability that Verstappen’s driver effect is better than Perez’s. To anyone who watches F1, this immediately raises a flag — Verstappen has dominated Perez comprehensively across every real-world metric.

This result is not a modeling error. It is the most important finding in the project and reveals a fundamental truth about what this metric actually measures.

We are not measuring “who is the fastest.” We are measuring “who adds the most value relative to their starting position.”

Return to the real 2023 examples from earlier:

Verstappen, Bahrain: Starts P1, finishes P1. Residual: −0.62. The model sees a small outperformance.

Perez, Australia: Starts P20, finishes P5. Residual: −9.33. The model sees a massive outperformance.

Both drove the same car that weekend. Both had the same Red Bull TeamSeason Effect of −1.21 working in their favor. After accounting for the car, Perez’s Driver Effect contribution in Australia is enormous. Verstappen’s in Bahrain is modest.

This is not an isolated example — it reflects a well-known pattern in Perez’s career. He frequently qualifies below where his race pace suggests he should be, then recovers positions on Sunday. That pattern is exactly what this metric rewards. Verstappen’s dominance in 2022–2024 involved many weekends of pole-to-win perfection. And perfection from the front generates near-zero “Value Added on Sunday.”

To show the model is not blind to Verstappen’s talent: in Round 2 of 2023 (Saudi Arabia), he started P15 after a qualifying problem and finished P2. His residual that weekend was −8.69 — one of the strongest single-race Driver Effect contributions in the dataset. The model sees and rewards that performance. It simply cannot reward the many weekends where he started first and finished first.

This is a structural property of residual-based ranking, not a flaw. Any metric built around finishing relative to starting position will have this characteristic. It is fully acknowledged — and is the primary motivation for the Version 2 model described at the end of this article.

The Interesting Middle

Alexander Albon at 4th is the most interesting result for anyone who follows F1 closely. Albon is not a name that appears in most GOAT debates. But across 114 races — primarily in the Williams, one of the slowest cars on the grid — he has consistently finished better than his grid slot expected. The model is picking up something real: Albon is widely regarded within the paddock as a driver who maximizes an underperforming car, and the numbers confirm it.

Alonso vs. Ricciardo (42.2% probability Alonso is better) is one of the most honest results in the dataset. These two drivers have been compared by fans for years — similar eras, similar peaks, both regarded as elite talents who perhaps never got the machinery to prove it definitively. The model’s answer: they sit within each other’s 50% credible intervals. Statistically indistinguishable on this metric across their careers. Whether that satisfies either side of the debate is another matter.

Norris vs. Leclerc (45.9%) tells a similar story for the current generation. The two most hotly debated young talents in F1 right now, and the model cannot separate them with any confidence. Both are elite. Any ranking between them at this point is noise.

The Bottom Tier

Kubica, Sargeant, Mazepin, and Hartley cluster at the bottom — consistent underperformance relative to their grid slots across their careers. This broadly aligns with the F1 community’s consensus view on these drivers, which is a useful sanity check: a model that produced nonsensical results at the bottom would be hard to trust at the top.

The Findings: Constructor Dominance

These team-season effects are exactly what the model subtracted out to isolate driver skill. They represent how much each constructor’s car finished better or worse than the qualifying position alone would predict, season by season.

View interactive version →

Each cell shows the posterior mean TeamSeason Effect for a constructor in a given year. Blue means the car finished better than expected relative to its qualifying position. Red means worse.

A few stories jump out immediately:

Mercedes entered the Hybrid Era in 2014 with a −1.45 effect and sustained dominance through most of the decade, peaking at −2.11 in 2022 — the darkest blue on the chart. Their fade to +0.30 in 2025 tells the story of a team that built its advantage around a specific set of technical regulations and lost that edge when the rules changed.

Red Bull shows two distinct peaks: the 2014–2017 era and the 2021–2023 era, with 2022 (−1.93) representing the apex of their dominance. Their 2024 and 2025 values suggest the competitive window is closing.

Williams from 2018 onward is a sustained red streak — the most visually obvious decline arc in the dataset.

McLaren’s arc from −0.63 in 2014 through the difficult 2017–2020 period and partial recovery is visible across the row — a team that lost its way and slowly found it again.

Haas 2023 (+2.90) is the single worst team-season in the dataset. Alpine 2025 (+2.56) is a close second.

One result worth flagging directly: Ferrari appears to outperform McLaren in 2025 on this metric, despite McLaren being widely regarded as the faster car that season. This is the same structural property that produces the Verstappen Paradox, now showing up at the constructor level. McLaren qualified on average P3 in 2025 and finished P3 — the model sees no surprise. Ferrari qualified P7 on average and finished P6 — the model rewards that consistent outperformance of grid position. By absolute pace, McLaren had the faster car. By this metric, Ferrari extracted more value than their qualifying position would suggest. The heatmap measures constructor value added relative to qualifying, not raw speed — and that distinction matters when interpreting any result that seems counterintuitive.

Validation

While these rankings will spark debate, their value depends entirely on whether the model actually works — so here is how it was tested.

Convergence. After fitting, the model runs a diagnostic called R-hat, which measures whether the simulation reached a stable, repeatable consensus. A value near 1.0 is the gold standard, meaning all simulation chains found the same answer independently. All parameters in this model returned R-hat values of approximately 1.0.

Posterior Predictive Check. This test asks: if the model truly understands how F1 race results are generated, can it simulate fake data that resembles real data?

The blue line shows the actual distribution of finish residuals across all 4,871 races in the dataset. The orange line shows the residuals of the fitted model simulated by drawing from its learned parameters. If the model is well-specified, the two lines should closely overlap. They do — confirming that the StudentT likelihood correctly captures the shape of real F1 race outcomes, including the extreme values at both ends.

Holdout RMSE. The model was trained on 2014–2024 data and evaluated on the fully held-out 2025 season — a year with new regulations, new rookies, and new team compositions the model had never seen.

RMSE
Bayesian Hierarchical Model 3.114
Naive baseline (predict residual = 0) 3.146

The model outperforms a naive baseline that simply predicts every driver finishes exactly where their grid slot expects. The margin is modest — F1 race outcomes contain substantial irreducible variance from safety cars, weather, and incidents that no model can predict. Beating any baseline at all on a completely unseen season is a meaningful result.

Limitations

Intellectual honesty about what a model cannot do is as important as what it can. These are not afterthoughts — they shaped decisions throughout the project.

The Verstappen Paradox. This metric measures value added relative to the starting position. Drivers who dominate from pole will naturally generate lower residuals than drivers who regularly recover from midfield. This is a structural property of the approach, fully acknowledged and discussed in the findings above.

Static driver effects. Each driver receives a single coefficient throughout their career. A 2014 Fernando Alonso and a 2024 Fernando Alonso are treated as draws from the same latent skill distribution. The model cannot capture the arc of a career — the development years, the peak, the decline.

Teammate quality. The TeamSeason Effect is estimated from the combined results of both drivers. A driver paired with a very weak teammate for several seasons may appear stronger than they are, because the car’s effect looks smaller when one driver is consistently underperforming it. This is not corrected for in the current model.

DNF classification complexity. Every retirement status string across 11 seasons was manually reviewed. The classification is rigorous, but edge cases exist — a collision caused by a mechanical failure mid-corner sits in an ambiguous grey area. The 140 rows labeled simply “Retired” were excluded rather than misclassified.

Season benchmark sparsity. The expected finish benchmark is computed within each season. Early rounds of a new season have fewer races to average over, making the benchmark noisier at the start of the year than at the end.

No circuit or weather effects. Monaco and Monza produce fundamentally different position-change profiles. Wet races introduce randomness that has nothing to do with driver skill. These are unmodeled variance sources that the StudentT absorbs rather than explicitly handles.

Sample-size fragility. Even with the 20-race minimum, drivers near that threshold carry wider uncertainty bands than veterans meaningfully. Treat the rankings of shorter-career drivers as directional rather than definitive.

Conclusion

So — after all this — who is the best F1 driver of the Hybrid Era?

On the specific metric this model uses — race performance relative to grid position expectation, controlling for car quality — Hamilton leads with 85.2% confidence over Verstappen. The model places Verstappen clearly in the elite tier, well separated from the midfield, but cannot reward his particular brand of dominance: perfection from the front generates no “Value Added on Sunday.”

The more honest answer is that “best driver” depends on what you are measuring. A model that rewards recovery and racecraft favors Hamilton. A model that also credits qualifying brilliance might tell a different story — which is exactly what the next version of this model aims to build.

What this project establishes clearly is that the uncertainty is real. Alonso vs. Ricciardo: statistically indistinguishable. Norris vs. Leclerc: a coin flip. These are not cop-outs — they are what the data actually says, and saying so is more useful than manufacturing false precision.

The one thing the model says with near certainty: the car matters enormously, and any ranking that ignores it is not really ranking drivers at all.

Where This Model Goes Next

The Verstappen Paradox is not just an interesting quirk — it points to a genuine gap in how this model defines driver skill. Fixing it properly requires rethinking the model structure, not just adding a variable.

The proposed Version 2 is a Dual-Path Latent Skill Model. Instead of one outcome variable, it uses two observation nodes that both draw from the same underlying driver skill parameter:
1. Qualifying delta — how much better did the driver qualify than the car’s theoretical position? A driver who puts a 10th-place car into the 6th spot on the grid is demonstrating real skill that the current model ignores entirely.
2. Race delta — how much did the driver move relative to their starting slot?
Both signals feed into one latent Driver Skill parameter. This means Verstappen qualifying 1st in a car that “belongs” at 3rd contributes a strong positive qualifying delta — giving him credit the current model cannot.

Version 2 would also introduce time-varying driver effects. Rather than one static career coefficient, a random walk prior would allow each driver’s estimated skill to evolve season by season — capturing what every F1 fan already knows: drivers develop, peak, and decline. A 2014 Alonso and a 2024 Alonso are not the same driver, and the next model will not treat them as if they are.

The goal is a model that can finally answer the question this one raises: not just who added the most value on race day, but who was the most complete driver — in qualifying, in the race, and across the arc of their career.

Built with FastF1, PyMC, ArviZ, pandas, and seaborn. Data covers the Formula 1 Hybrid Era, 2014–2025. Full methodology, code, and data pipeline available on GitHub.
April 18, 2026
Why Height Doesn’t Predict Velocity in Major League Baseball

Walk into any major league clubhouse and you’ll feel like you’ve entered a land of giants. Today’s MLB pitchers tower over the average American man, standing a full 4-5 inches taller at an average of 6’3″. But this wasn’t always the case. In the 1980s, the typical pitcher was closer to 6’0″, still above average, but not quite the commanding physical presence we see today.

This upward trend shows no signs of slowing. Each new crop of prospects seems to stretch a little higher, with 6’5″ and 6’6″ frames becoming increasingly common on major league rosters. Front offices have clearly bought into the idea that bigger is better, and on the surface that logic is sound. Just like a longer wrench provides more leverage, taller pitchers with longer arms should generate more torque and throw harder. Our eyes seem to confirm this when we watch someone like 6’10” Randy Johnson unleash 100-mph fastballs with ease.

So I decided to confirm what seemed like baseball common sense by examining how pitcher height correlates with velocity.

The Puzzling Data

I analyzed every MLB pitcher who threw at least 50 innings during the 2024 season and plotted them according to their height and average fastball velocity. When I saw the results I thought I had done something wrong, but I hadn’t. The correlation coefficient really was -0.001, indicating absolutely no relationship between height and velocity in the data.

A Look at a Study

Something clearly wasn’t adding up. To understand what was happening, I turned to recent academic research that has examined this exact phenomenon.

A comprehensive 2024 study published in Orthopedic Journal of Sports Medicine [1] provides a detailed analysis of how physical attributes translate to pitching velocity across competition levels. Using 46 reflective markers and 8 cameras capturing data at 480 Hz, the researchers tracked every micro-movement of 337 professional pitchers and 59 high school players.

The researchers’ regression models could explain an astounding 92.5% of velocity variation in high school pitchers using physical and bio-mechanical factors. At the professional level? That predictive power plummeted to just 53.6%. The physical advantages that completely dominate at lower levels become increasingly diluted as talent concentrates.

The study compared findings across youth through professional levels. In the youngest players, basic measures like age, height, and body mass index were strong velocity predictors. College studies found that weight remained predictive, suggesting mass continues mattering through stronger bodies generating more power. Yet by professional ranks, these fundamental physics relationships are no longer visible due to the intense selection process it took to get there.

Selection Bias

The answer lies in understanding just how brutally selective Major League Baseball really is. Only about 0.5% of high school baseball players will ever play professionally at any level, and only a fraction of those make it to the majors. This creates a statistical phenomenon that economists and researchers call “selection bias”, where the filtering process itself changes what you observe in the final dataset.

Shorter pitchers face an uphill battle from day one. To overcome their physical disadvantage, they must develop exceptional skills elsewhere, such as devastating command, overpowering velocity through mechanical perfection, or unhittable secondary pitches. The shorter pitchers who eventually reach MLB represent the cream of the crop, the ones who found ways to excel despite their limitations.

The MLB isn’t a random sample of all pitchers, it’s the survivors of an elimination tournament so intense that it fundamentally changes the population you’re studying. By the time you’re looking at MLB data, you’re seeing the end product of vastly different developmental paths, where the original physical advantages have been absorbed into the noise of elite-level talent optimization.

What This Means

This trend will continue. Pitchers will keep getting taller on average, but the velocity correlation will remain at zero. We’re witnessing how ultra-elite competition obscures natural physical advantages in the statistical record.

The broader lesson is crucial for sports analytics. Sometimes the absence of a correlation reveals more about the selection process than the underlying relationships. Physics works exactly as expected, but MLB’s selection bias masks these fundamental relationships in the data we can observe.

References

[1] Manzi JE, Dowling B, Wang Z, Sudah SY, Moran J, Chen FR, Estrada JA, Nicholson A, Ciccotti MC, Ruzbarsky JJ, Dines JS. Kinematic Modeling of Pitch Velocity in High School and Professional Baseball Pitchers: Comparisons With the Literature. Orthop J Sports Med. 2024 Aug 13;12(8):23259671241262730. doi: 10.1177/23259671241262730.

May 13, 2025


DQS (Decision Quality Score)	Average WPA gap between optimal call and actual call	Lower = better
ODR (Optimal Decision Rate)	% of plays where coach made the historically optimal call	Higher = better

Starting Position	2023 Expected Finish
P1 (1st)	1.62
P8 (8th)	8.76
P15 (15th)	10.69
P20 (20th)	14.33

	RMSE
Bayesian Hierarchical Model	3.114
Naive baseline (predict residual = 0)	3.146

Category: Projects

Fourth Down Is Still Football’s Biggest Coaching Problem

The Play That Changed Everything

What Win Probability Actually Measures

The NFL’s Quiet Revolution — And Its Limits

What Bad Decisions Actually Cost

Where Coaches Go Wrong Most Often

Who’s Doing It Right

Explore It Yourself

How I Built This

The Gap Is Closing. It Hasn’t Closed.

Who Is Actually the Best F1 Driver? A Bayesian Approach to Separating Skill from the Car

A Quick Primer on How F1 Works

The Approach: Measuring “Value Added on Sunday”

Step 1 — The Expected Finish Benchmark

Step 2 — The Residual

Step 3 — Separating Driver from Car

The Model Structure

Why Bayesian?

Key Modeling Decisions

The Findings: Driver Rankings

Hamilton at the Top

The Verstappen Paradox

The Interesting Middle

The Bottom Tier

The Findings: Constructor Dominance

Validation

Limitations

Conclusion

Where This Model Goes Next

Why Height Doesn’t Predict Velocity in Major League Baseball

The Puzzling Data

A Look at a Study

Selection Bias

What This Means