Public guide

How the Troposphere ranking works

A plain-language guide to the model, why XC and competition are treated differently, and how quality is verified.

One public board Skill plus uncertainty XC and competition handled differently Validated on held-out replay
Transparency note

Transparency

This page explains what the live model does, its limitations, and how changes are validated. Internal constants are not disclosed exhaustively because they may change during re-validation. The methodology guide exists for transparency, not as permission to reproduce, reverse-engineer, or reimplement the system.

What it is not

Not a points table

Troposphere does not just total flights or reward upload volume. It learns from repeated head-to-head evidence over time and publishes a unified ranking only after validation checks show that the probabilities remain believable.

Section 1

The short version

What the model tracks

It estimates skill, not just points

Each pilot carries an estimated skill level and an uncertainty level. Strong results against quality opposition raise the estimate. Weak results lower it. Repeated evidence over time carries more weight than any single standout performance, and uncertainty matters nearly as much as the central estimate.

Signal quality

XC and competition produce different signal quality

Competition results are cleaner because pilots fly identical tasks under controlled conditions. XC results vary more due to weather sensitivity, site choice, and route choice. The model uses both evidence types but maintains higher uncertainty around XC-heavy profiles and learns from competition evidence more decisively per event.

What the board shows

The published score is conservative

The board favors confident estimates over optimistic interpretations. A pilot with fewer events or a narrow opponent pool may appear strong internally but will be shown conservatively until the evidence base grows.

What users should expect

Competition-heavy pilots usually settle faster. XC-heavy pilots can climb strongly too, but they generally need more repeated evidence before the board becomes equally confident. Small score gaps should not be read like absolute truths; the model is always dealing with uncertainty.
Section 2

How the ranking works

1
Evidence intake

Flights and results are cleaned, linked, and grouped by pilot-day

The system ingests validated flight evidence and competition results. Flights are linked to pilot identity, deduplicated, and resolved to one best-scored flight per pilot per day. When multiple sources report the same pilot-day, competition evidence takes priority.

2
Skill estimation

Each pilot is tracked as skill estimate plus uncertainty

Each pilot maintains a skill estimate and an uncertainty estimate. Beating expectations raises the skill estimate; underperforming lowers it. The system updates both after every event using a ranking-likelihood model that considers all participants simultaneously.

3
Domain-aware logic

Competition and XC use different learning settings

Competition and XC use different learning rates, noise parameters, and uncertainty floors. Competition evidence is treated as more informative per event. XC evidence carries more noise, so uncertainty settles more slowly. Both contribute to the same pilot profile.

4
Board build

Several internal reads are blended into one public board

The system maintains separate internal reads for broad evidence, XC-heavy evidence, and competition-heavy evidence. These are confidence-weighted and blended into one published score, combining airmass-reading and cross-country-execution skill components.

5
Validation

Every model change is checked on held-out replay

Every model change undergoes held-out replay testing. The system checks whether predicted probabilities match actual outcomes across multiple validation regimes, including forward-time holdout, leave-one-competition-out, and recent cross-domain transfer. Changes are promoted only when probability quality improves without degrading ranking discrimination.

National teams use a separate aggregation layer

The pilot board still ranks individuals. The Nations board now uses Roster Dominance Integral: it compares countries across shared real roster depth from the top pilot downward, skips depths a nation does not really have, and then rescales the raw dominance mean into a cleaner public index.
Section 3

Why XC and competition are different

Competition signal

Competition is cleaner evidence per event

Competition places many pilots on the same task on the same day under comparable conditions. If pilot A repeatedly finishes ahead of pilot B on competition tasks, that says something reliable about relative strength.

XC signal

XC is broader and noisier by design

XC is different. Pilots choose their own day, site, route, and timing. Two pilots who never fly the same day in the same airmass are harder to compare. A single strong XC day may reflect great skill, a great day, or both, and the model cannot fully separate those effects from one event alone.

The system respects that difference instead of flattening it away

Competition events use a lower noise parameter, allowing the model to learn faster per event. XC events use a higher uncertainty floor, preventing the model from becoming overconfident too quickly. This does not mean XC is ignored; it means the system is more cautious about what any single XC event proves.
Section 4

How quality is verified

Calibration

Do displayed percentages match reality?

When the model says pilot A has a 70% chance of beating pilot B, that should happen roughly 70% of the time across many such predictions. The main quality metric is calibration slope: a slope near 1.0 means confidence matches reality. Below 1.0 means predictions are too extreme. Above 1.0 means they are too flat.

ECE

How far off are probabilities on average?

Expected Calibration Error groups similar predictions into bins and measures the average gap between predicted and actual outcome rates. Lower is better because it means the displayed percentages are more believable.

C-index

Did we break ranking discrimination?

The concordance index checks whether the model still orders pilot pairs correctly. When it says A should beat B, does A actually win more often? This is the ranking-quality guardrail.

Post-hoc calibration

Probability output can be corrected without changing ranking order

After the core model produces raw probabilities, a monotonic correction layer can adjust probability output without changing who ranks above whom. This is used to improve displayed probability quality while leaving the leaderboard itself untouched.

Validation regimes

No single test is enough

The system validates across multiple held-out regimes: rolling forward holdout, leave-one-competition-out, and cross-domain transfer. Changes are promoted only when these checks show improvement or stability.

Section 5

Technical reference

Model family

Bayesian-style ranking with domain-aware dampening

The live model is a practical Bayesian ranking approximation. Skill updates use Plackett-Luce style ranking gradients. Uncertainty updates use Fisher-information-style precision updates with domain-specific dampening.

\[ \mu = \text{current skill estimate}, \qquad \sigma = \text{current uncertainty}, \qquad \beta = \text{domain-dependent performance noise} \]
In plain language: \(\mu\) is the model's current best estimate of level, \(\sigma\) is how unsure it still is, and \(\beta\) controls how noisy the domain itself is.
Probability model

Pairwise probability

Pairwise probabilities combine skill gap, uncertainty, domain noise, and an extra competition-specific prediction-volatility term.

\[ P(A \text{ beats } B) = \Phi \!\left( \frac{\mu_A - \mu_B} {\sqrt{\beta_{\text{domain}}^2 + \sigma_A^2 + \sigma_B^2 + 2v_{\text{comp}}^2}} \right) \]
Here \(\Phi\) is the standard normal CDF. Larger uncertainty or noisier domains flatten the probability toward 50%. For XC, \(v_{\text{comp}} = 0\).
Uncertainty update

Precision grows as events add Fisher information

Each event adds information in precision space. Competition events can also apply a discount factor for correlated multi-task structure.

\[ \frac{1}{\sigma_{\text{new}}^2} = \frac{1}{\sigma_{\text{old}}^2} + \lambda \cdot \mathcal{I}_{\text{Fisher}} \cdot w_{\text{event}} \cdot d \]
\[ \qquad d = \frac{1}{1 + \rho (K - 1)} \]
\(\lambda\) is the domain information scale, \(w\) is the event weight, \(K\) is task count, and \(\rho\) is the intra-event correlation parameter.
Output calibration

Platt scaling adjusts probability quality without altering order

A monotonic output-calibration layer can adjust the raw probabilities while preserving ranking order.

\[ p_{\text{calibrated}} = \sigma \!\left( a \cdot \operatorname{logit}(p_{\text{raw}}) + b \right) \]
This correction is monotonic, so it changes probability quality without changing ranking order.
National teams

Roster Dominance Integral compares countries across shared real depth

The current Nations board does not use a positional top-five match anymore. For each country and each depth \(d\), it builds a depth-level Gaussian from the top \(d\) pilots, compares countries only on depths both actually have, and then averages those dominance probabilities with harmonic tapering.

\[ F_N(d) \sim \mathcal{N} \!\left( \frac{1}{d}\sum_{i=1}^{d}\mu_i^N, \; \frac{1}{d^2}\sum_{i=1}^{d}\sigma_i^{2} \right) \]
\[ D_{A>B}(d) = \Phi \!\left( \frac{\bar{\mu}_A(d)-\bar{\mu}_B(d)} {\sqrt{\operatorname{Var}_A(d)+\operatorname{Var}_B(d)}} \right) \]
\[ \operatorname{RDI}(A,B) = \frac{\sum_{d=1}^{d^\*} w_d \, D_{A>B}(d)} {\sum_{d=1}^{d^\*} w_d}, \qquad w_d = \frac{1}{d}, \qquad d^\* = \min(D,n_A,n_B) \]
In plain language: the Nations board compares how strong two countries look at depth 1, depth 2, depth 3, and so on, but only as far as both countries have real roster depth. Missing slots are skipped rather than padded with synthetic pilots.
Public display layer

Nation strength keeps the raw order but uses a cleaner public scale

The raw round-robin RDI mean remains the ranking backbone. The public Nations score is then passed through a monotonic display rescaling so that the published 0-100 number is easier to read without changing country order. The uncertainty band is derived from the same depth-aware pairwise comparisons.

This means the public number is a presentation layer on top of the raw dominance mean. It does not alter which nation ranks above another.