The Question Behind Every Championship

There is a debate that has run through Formula 1 for decades: how much of a champion's success belongs to the driver, and how much to the car?

The most common version targets Lewis Hamilton. Seven world championships, 103 race wins — but was he truly the greatest driver of his generation, or simply the beneficiary of the most dominant car in the sport's history? Fans of Fernando Alonso, in particular, have argued for years that with the same machinery, the story might have looked very different.

You cannot answer this by looking at championship standings. In Formula 1, the car is the single biggest variable on the grid — two drivers on the same grid are not competing on equal terms. They never are.

This project answers the question properly, using a Bayesian hierarchical model on 12 seasons of race data (2014–2025) to isolate each driver's contribution after controlling for the car they drove.

The Answer

R
F1 Driver Rankings — Hybrid Era (2014–2025)
46 drivers · 20+ races
HAMPERVERALBLECVETNORROSMASRICHULBEASAIBUTALOVANCOLRAIRUSPALBOTKVYWEHGROSTRGASPIALAWNASOCOZHOHADGUTMALMAGMSCANTERIGIOLATSIRTSUHARMAZSARKUB−2.0−1.5−1.0−0.50.00.51.01.52.0DRIVER EFFECT  ·  NEGATIVE = BETTER
94% HDI50% HDIMedian

Each bar is one driver's range of plausible "value added" — how much better or worse they finished than their starting grid slot would historically predict, after accounting for the car they drove. The dot is the median estimate. The thick inner bar is the 50% credible interval; the thin line is 94%. Drivers on the left consistently outperform their grid slot; drivers on the right consistently underperform.

Hamilton sits at the top. The model gives him an 85.2% probability of having a higher driver effect than Verstappen. Compared to Alonso, the probability climbs to 98.6% — the strongest pairwise statement in the dataset.

But the result that makes this model interesting is two slots below Hamilton: Perez ranks above Verstappen. Yes, that Perez. Yes, in the same Red Bull. The rest of this article is why.

The Verstappen Paradox

The model gives only a 21.4% probability that Verstappen's driver effect is better than Perez's. To anyone who watches F1, this immediately raises a flag — Verstappen has dominated Perez comprehensively across every real-world metric.

This result is not a modeling error. It is the most important finding in the project and reveals what this metric actually measures.

We are not measuring "who is the fastest." We are measuring "who adds the most value relative to their starting position."

Two real 2023 examples make the difference concrete:

Verstappen, Bahrain: Starts P1, finishes P1. The 2023 benchmark for starting on pole is 1.62 — what a typical driver in a typical car historically does from that grid slot that season. His residual: 1 − 1.62 = −0.62. A small outperformance.

Perez, Australia: Starts P20, finishes P5. The benchmark for starting last in 2023 is 14.33 (a lot of cars retire). His residual: 5 − 14.33 = −9.33. A massive outperformance.

Both drove the same Red Bull. Both got the same team-season effect of −1.21 working in their favor. After accounting for the car, Perez's "Value Added on Sunday" in Australia is enormous. Verstappen's in Bahrain is modest.

This isn't an isolated example. Perez frequently qualifies below where his race pace suggests, then recovers positions on Sunday. That's exactly what this metric rewards. Verstappen's dominance in 2022–2024 involved many weekends of pole-to-win perfection — and perfection from the front generates near-zero residual.

To prove the model isn't blind to Verstappen's talent: Round 2 of 2023 (Saudi Arabia), he started P15 after a qualifying problem and finished P2. His residual: −8.69 — one of the strongest single-race contributions in the dataset. The model sees and rewards that performance. It simply cannot reward weekends where he started first and finished first.

This is a structural property of residual-based ranking, not a flaw. It is fully acknowledged — and is the primary motivation for the Version 2 model described at the end of this article.

Other Notable Results

Alexander Albon at 4th is the most interesting non-headliner. Albon doesn't appear in GOAT debates, but across 114 races — mostly in the Williams, one of the slowest cars on the grid — he consistently finished better than his grid slot. The model picks up something real: he's regarded within the paddock as a driver who maximizes underperforming cars.

Alonso vs. Ricciardo (42.2% probability Alonso is better) is one of the most honest results in the dataset. These two have been compared by fans for years — similar eras, similar peaks. The model places them inside each other's 50% credible intervals. Statistically indistinguishable.

Norris vs. Leclerc (45.9%) tells the same story for the current generation. Both elite. Any ranking between them at this point is noise.

The bottom tier — Kubica, Sargeant, Mazepin, Hartley — clusters as consistent underperformers. This broadly matches the F1 community's consensus view, which is a useful sanity check: a model that produced nonsensical results at the bottom would be hard to trust at the top.

How the Model Works

The core idea: stop asking "where did the driver finish?" and start asking "did the driver finish better or worse than expected, given where they started?" That gap — finish minus benchmark — is the residual. A negative residual means the driver overperformed.

The benchmark is computed within each season, not pooled across years. P7 in 2014 is a fundamentally different competitive reality from P7 in 2023. The 2023 benchmarks for a few key grid slots:

Starting Position2023 Expected Finish
P1 (1st)1.62
P8 (8th)8.76
P15 (15th)10.69
P20 (20th)14.33

Notice P20 expects to finish around P14 — a meaningful share of cars retire or get passed during a race, so starting last doesn't mean finishing last.

But a raw residual still mixes two signals: the driver's contribution and the car's contribution. A dominant car starting P2 and finishing P1 generates a negative residual, but that's not the driver alone — the benchmark for P2 includes every midfield car that ever started there, so a fast car beats the benchmark just by being fast. The model adds a TeamSeason Effect — a per-constructor, per-season term — to absorb that car-side advantage. (In F1 the "team" or constructor designs and builds its own car, so this term is shorthand for the car's competitive level in a given year.) Only after subtracting the benchmark and the team-season effect does the remaining signal get attributed to the driver.

The model

finish_residualStudentT(ν,μ,σ)\text{finish\_residual} \sim \text{StudentT}(\nu, \mu, \sigma) μ=α+Driver Effect+TeamSeason Effect+βdnfdnf_driver_fault\mu = \alpha + \text{Driver Effect} + \text{TeamSeason Effect} + \beta_{\text{dnf}} \cdot \text{dnf\_driver\_fault}

The model is hierarchical — driver effects and team-season effects are estimated jointly, each informing the other. A driver's effect is estimated in the context of every team they drove for and every season available, not in isolation.

Why StudentT instead of a normal distribution? F1 race residuals have fat tails — extreme outcomes (finishing 10 places better than expected) happen more often than a bell curve would predict. Safety cars, first-lap chaos, and rain races all create these extremes. A Gaussian would over-fit to outliers; StudentT handles them robustly. The model learned ν ≈ 3.5, confirming F1 results are genuinely heavier-tailed than normal.

Why Bayesian

A standard regression gives one number per driver — a point estimate. A Bayesian model gives a probability distribution — a full range of plausible skill levels, each with an associated probability. Two things this enables:

  1. Honest uncertainty. Hamilton has 243 races; the model is confident. A rookie with 21 races gets a much wider posterior. A traditional ranking treats those estimates as equally reliable. This one doesn't.
  2. Pairwise probabilities. Instead of "Hamilton is 1st," the model says "85.2% probability Hamilton's true effect is better than Verstappen's." Fundamentally more useful.

Key Modeling Decisions

Season-level team effects, not career-level. Red Bull in 2023 (21 of 22 wins) is fundamentally a different car than Red Bull in 2019. Season-level indexing lets the model track each constructor's year-by-year arc accurately.

Zero-sum constraints on both effect sets. Driver Effects and TeamSeason Effects each sum to zero across their groups. Without this, the two effect sets can drift in opposite directions and become uninterpretable. Every driver comparison is now relative to the grid-average driver; every team comparison is relative to the average constructor-season.

DNF classification via manual audit. FastF1's raw retirement status strings are inconsistent across 11 seasons. Every string was manually reviewed and assigned to one of four categories: finished, driver-fault, mechanical, or ambiguous. Driver-fault retirements (crashes, collisions) carry an explicit penalty in the model. Mechanical failures don't — a driver shouldn't be penalized for their engine failing. The 140 rows labeled simply "Retired" were excluded rather than guessed at.

20-race minimum for the final rankings. The model fits all drivers but only ranks those with ≥20 race starts. This was validated during development: Franco Colapinto ranked top-5 after 6 races (mean residual −1.94), then settled to average after 23 races (mean residual +1.31). Small samples produce wide, unstable estimates. The threshold keeps the headline output honest; drivers near the cutoff carry wider uncertainty bands than veterans.

Constructor Dominance

These team-season effects are exactly what the model subtracted out to isolate driver skill. They represent how much each constructor's car finished better or worse than qualifying position alone would predict, season by season.

C
Constructor Performance — Hybrid Era (2014–2025)
TeamSeason Effect
Alfa RomeoAlfa Romeo RacingAlphaTauriAlpineAston MartinCaterhamFerrariForce IndiaHaas F1 TeamKick SauberLotus F1Manor MarussiaMarussiaMcLarenMercedesRBRacing BullsRacing PointRed BullRenaultSauberToro RossoWilliams2014201520162017201820192020202120222023202420250.801.280.13−0.30−0.07−0.640.350.511.31−0.56−1.140.231.202.56−0.09−0.79−0.801.011.141.43−0.73−1.08−0.99−0.95−0.830.020.13−0.60−1.040.47−0.41−0.19−1.080.05−0.43−1.26−0.14−0.26−0.040.091.630.771.561.072.900.38−0.131.710.470.45−0.681.721.280.38−0.630.590.170.49−0.41−0.31−1.04−0.610.120.370.12−0.16−1.450.45−1.02−0.89−0.89−0.68−0.75−0.95−2.11−1.11−0.190.301.281.190.22−0.68−0.25−1.42−0.40−0.88−1.73−1.43−1.14−0.89−1.17−1.93−1.210.140.39−0.130.94−0.860.20−0.851.07−0.410.250.55−0.631.280.420.34−0.030.20−0.60−0.15−0.59−0.40−0.251.031.910.710.771.161.420.970.31
123 team-seasons · 23 constructorsBetterZeroWorse

Each cell shows the posterior mean TeamSeason Effect. Blue means the car finished better than expected relative to qualifying. Red means worse.

A few stories jump out:

Mercedes entered the Hybrid Era in 2014 with a −1.45 effect and sustained dominance through most of the decade, peaking at −2.11 in 2022. Their fade to +0.30 in 2025 tells the story of a team that built its advantage around a specific set of technical regulations and lost the edge when those rules changed.

Red Bull shows two distinct peaks — 2014–2017 and 2021–2023, with 2022 (−1.93) at the apex. Their 2024 and 2025 values suggest the window is closing.

Williams from 2018 onward is a sustained red streak — the most visually obvious decline arc in the dataset.

McLaren's arc from −0.63 in 2014 through the difficult 2017–2020 period and partial recovery is visible across the row — a team that lost its way and slowly found it.

Haas 2023 (+2.90) is the single worst team-season in the dataset. Alpine 2025 (+2.56) is a close second.

One result worth flagging: Ferrari appears to outperform McLaren in 2025 on this metric, despite McLaren being widely regarded as the faster car that season. This is the Verstappen Paradox again, now at the constructor level. McLaren qualified P3 on average and finished P3 — no surprise. Ferrari qualified P7 and finished P6 — consistent outperformance of grid position. By absolute pace, McLaren had the faster car. By this metric, Ferrari extracted more value than their qualifying suggested.

Validation

Convergence. R-hat measures whether the sampler reached a stable consensus across independent chains. A value near 1.0 is the gold standard. All parameters in this model returned R-hat ≈ 1.0.

Posterior Predictive Check. If the model truly understands how F1 race results are generated, it should be able to simulate fake data that resembles the real data.

P
Posterior Predictive Check — Residual Distribution
4,871 races · ν ≈ 3.50
-20-15-10-5051015200.000.020.040.060.080.100.120.14FINISH RESIDUAL  ·  NEGATIVE = BETTER THAN GRID EXPECTATIONDENSITY0
click to hide / show
tails beyond ±20 truncated · 9 of 4,871 pp samples
observed med
−0.57
posterior med
−0.59
observed sd
4.18
posterior sd
4.20

The cyan curve is the actual distribution of finish residuals across all 4,871 races. The amber curve is residuals simulated from the fitted model's learned parameters. They closely overlap — confirming the StudentT likelihood correctly captures the shape of real F1 outcomes, including extremes at both ends.

Holdout RMSE. Trained on 2014–2024, evaluated on the fully held-out 2025 season — a year with new regulations, new rookies, and new team compositions the model had never seen.

ModelRMSE
Bayesian Hierarchical Model3.114
Naive baseline (predict residual = 0)3.146

The model outperforms a naive baseline that simply predicts every driver finishes where their grid slot expects. The margin is modest — F1 race outcomes contain substantial irreducible variance from safety cars, weather, and incidents no model can predict. Beating any baseline on a completely unseen season is meaningful.

Limitations

Intellectual honesty about what a model can't do is as important as what it can.

The Verstappen Paradox. Discussed above. A structural property of any residual-based metric.

Static driver effects. Each driver gets a single coefficient across their entire career. A 2014 Alonso and a 2024 Alonso are treated as draws from the same skill distribution. The model can't capture the arc of a career.

Teammate quality confound. The TeamSeason Effect is estimated from both drivers' combined results. A driver paired with a weak teammate for multiple seasons may appear stronger than they are.

DNF classification edge cases. Every retirement string was manually reviewed but ambiguous cases exist — a collision caused by a mechanical failure sits in a grey area. 140 rows labeled simply "Retired" were excluded rather than guessed.

Season benchmark sparsity. Early rounds of a new season have fewer races to average over, making the benchmark noisier at season start.

No circuit or weather effects. Monaco and Monza produce fundamentally different position-change profiles. Wet races introduce randomness unrelated to skill. StudentT absorbs these rather than modeling them explicitly.

Sample-size fragility. Drivers near the 20-race minimum carry meaningfully wider uncertainty bands. Their rankings are directional, not definitive.

Conclusion

On the metric this model uses — race performance relative to grid expectation, controlling for car quality — Hamilton leads with 85.2% confidence over Verstappen. Verstappen sits clearly in the elite tier, well separated from the midfield, but the model can't reward his particular brand of dominance: perfection from the front generates no "Value Added on Sunday."

The honest answer is that "best driver" depends on what you measure. A model that rewards recovery and racecraft favors Hamilton. A model that also credits qualifying brilliance would tell a different story — which is exactly what V2 aims to build.

The one thing this model says with near certainty: the car matters enormously, and any ranking that ignores it isn't really ranking drivers at all.

Where This Model Goes Next

The Verstappen Paradox isn't just a quirk — it points to a real gap in how the model defines driver skill. Fixing it properly requires rethinking the structure.

Version 2 is a Dual-Path Latent Skill Model: two observation nodes that both draw from one underlying driver skill parameter:

  1. Qualifying delta — how much better did the driver qualify than the car's theoretical position? A driver who puts a 10th-place car in 6th demonstrates real skill the current model ignores.
  2. Race delta — how much did they move relative to their starting slot? (The current metric.)

Both feed one latent Driver Skill parameter. Verstappen qualifying 1st in a car that "belongs" at 3rd now contributes a strong positive qualifying signal — giving him credit the current model can't.

V2 would also introduce time-varying driver effects via a random walk prior, allowing each driver's skill to evolve season by season. A 2014 Alonso and a 2024 Alonso are not the same driver, and V2 will stop treating them as if they are.

The goal: a model that answers not just who added the most value on race day, but who was the most complete driver — qualifying, racing, and across the arc of a career.


Built with FastF1, PyMC, ArviZ, pandas, and seaborn. Data covers the Formula 1 Hybrid Era (2014–2025). Full methodology and code live in the project repo.