How the Troposphere ranking works
A plain-language guide to the model, why XC and competition are treated differently, and how quality is verified.
A plain-language guide to the model, why XC and competition are treated differently, and how quality is verified.
This page explains what the live model does, its limitations, and how changes are validated. Internal constants are not disclosed exhaustively because they may change during re-validation. The methodology guide exists for transparency, not as permission to reproduce, reverse-engineer, or reimplement the system.
Troposphere does not just total flights or reward upload volume. It learns from repeated head-to-head evidence over time and publishes a unified ranking only after validation checks show that the probabilities remain believable.
Each pilot carries an estimated skill level and an uncertainty level. Strong results against quality opposition raise the estimate. Weak results lower it. Repeated evidence over time carries more weight than any single standout performance, and uncertainty matters nearly as much as the central estimate.
Competition results are cleaner because pilots fly identical tasks under controlled conditions. XC results vary more due to weather sensitivity, site choice, and route choice. The model uses both evidence types but maintains higher uncertainty around XC-heavy profiles and learns from competition evidence more decisively per event.
The board favors confident estimates over optimistic interpretations. A pilot with fewer events or a narrow opponent pool may appear strong internally but will be shown conservatively until the evidence base grows.
The system ingests validated flight evidence and competition results. Flights are linked to pilot identity, deduplicated, and resolved to one best-scored flight per pilot per day. When multiple sources report the same pilot-day, competition evidence takes priority.
Each pilot maintains a skill estimate and an uncertainty estimate. Beating expectations raises the skill estimate; underperforming lowers it. The system updates both after every event using a ranking-likelihood model that considers all participants simultaneously.
Competition and XC use different learning rates, noise parameters, and uncertainty floors. Competition evidence is treated as more informative per event. XC evidence carries more noise, so uncertainty settles more slowly. Both contribute to the same pilot profile.
The system maintains separate internal reads for broad evidence, XC-heavy evidence, and competition-heavy evidence. These are confidence-weighted and blended into one published score, combining airmass-reading and cross-country-execution skill components.
Every model change undergoes held-out replay testing. The system checks whether predicted probabilities match actual outcomes across multiple validation regimes, including forward-time holdout, leave-one-competition-out, and recent cross-domain transfer. Changes are promoted only when probability quality improves without degrading ranking discrimination.
Competition places many pilots on the same task on the same day under comparable conditions. If pilot A repeatedly finishes ahead of pilot B on competition tasks, that says something reliable about relative strength.
XC is different. Pilots choose their own day, site, route, and timing. Two pilots who never fly the same day in the same airmass are harder to compare. A single strong XC day may reflect great skill, a great day, or both, and the model cannot fully separate those effects from one event alone.
When the model says pilot A has a 70% chance of beating pilot B, that should happen roughly 70% of the time across many such predictions. The main quality metric is calibration slope: a slope near 1.0 means confidence matches reality. Below 1.0 means predictions are too extreme. Above 1.0 means they are too flat.
Expected Calibration Error groups similar predictions into bins and measures the average gap between predicted and actual outcome rates. Lower is better because it means the displayed percentages are more believable.
The concordance index checks whether the model still orders pilot pairs correctly. When it says A should beat B, does A actually win more often? This is the ranking-quality guardrail.
After the core model produces raw probabilities, a monotonic correction layer can adjust probability output without changing who ranks above whom. This is used to improve displayed probability quality while leaving the leaderboard itself untouched.
The system validates across multiple held-out regimes: rolling forward holdout, leave-one-competition-out, and cross-domain transfer. Changes are promoted only when these checks show improvement or stability.
The live model is a practical Bayesian ranking approximation. Skill updates use Plackett-Luce style ranking gradients. Uncertainty updates use Fisher-information-style precision updates with domain-specific dampening.
Pairwise probabilities combine skill gap, uncertainty, domain noise, and an extra competition-specific prediction-volatility term.
Each event adds information in precision space. Competition events can also apply a discount factor for correlated multi-task structure.
A monotonic output-calibration layer can adjust the raw probabilities while preserving ranking order.
The current Nations board does not use a positional top-five match anymore. For each country and each depth \(d\), it builds a depth-level Gaussian from the top \(d\) pilots, compares countries only on depths both actually have, and then averages those dominance probabilities with harmonic tapering.
The raw round-robin RDI mean remains the ranking backbone. The public Nations score is then passed through a monotonic display rescaling so that the published 0-100 number is easier to read without changing country order. The uncertainty band is derived from the same depth-aware pairwise comparisons.