How does it work?
Claims Reloaded provides a rigorous statistical framework for evaluating the robustness of performance claims
in machine learning benchmarks. This section introduces the foundational concepts and mathematical methodology
used to quantify the probability that a reported outperformance may be due to sampling variability rather than
genuine algorithmic superiority.
Understanding the definitions and methodologies is essential for interpreting the tool's output and making
informed assessments of published results. For a comprehensive description of the
methodological process and underlying assumptions, please refer to [1].
Definitions [1]
-
Test set: The independent set of cases on which both
methods A (reported winner) and B (second-ranked method) are evaluated. The size of
the test set (n) strongly influences statistical certainty.
-
Metric (DSC): The Dice Similarity Coefficient, commonly
reported in segmentation benchmarks. Values range from 0 to 1.
-
Method A (ranked first): The algorithm reported as the
winner.
-
Method B (ranked second): The algorithm reported as the
next-best performing method.
-
Model congruence: Measure of how similarly two methods perform across the same test
cases. For segmentation tasks, it is measured as the correlation between per-case segmentation
DSC scores obtained by the two methods. For segmentation tasks, it is defined as the proportion of test
samples correctly classified by both methods. High congruence means both methods tend to succeed or fail
on the same cases.
-
Posterior probability of B ≥ A: The Bayesian posterior probability that the
true mean performance of method B is greater than or equal to that of method A, conditional on the observed
mean scores and test set size. By definition, this probability cannot exceed 0.5, because method A is
reported as superior.
Methodology for Segmentation
The probability of a false outperformance claim for segmentation tasks is defined as:
\[
P(\theta_A \le \theta_B \mid \text{reported results})
\;=\;
P(\mu_A, \mu_B \mid \hat{\mu}_A, \hat{\mu}_B)
\;=\;
t_{n-1}\!\left(
\sqrt{n}\,\frac{\hat{\mu}_B - \hat{\mu}_A}{
\sqrt{\,s_A^2 + s_B^2 - 2\,s_A s_B\, r_{AB}\,}
}
\right)
\]
- \(μ_A\), \(μ_B\): True mean DSC scores of methods A and B.
- \(μ̂_A\), \(μ̂_B\): Reported mean DSC scores of methods A and B.
- \(n\): Test set sample size.
- \(t_{n−1}\): The quantile of the Student distribution with n − 1 degrees of freedom.
- \(s_A\), \(s_B\): The standard deviation for methods A and B imputed from \(μ̂_A\), \(μ̂_B\).
- \(r_{AB}\): Model congruence. The correlation between the performance of method A and method B.
As we do not have access to this value, we make an assumption based on previous experiments.
Methodology for Classification
The probability of false outperformance claims for classification tasks is defined as:
\[
P(\theta_A \le \theta_B \mid \text{reported results})
\;=\;
P(p_A \le p_B \mid \hat{p}_A, \hat{p}_B)
\;=\;
P(p_1 \le p_2 \mid \hat{p}_A, \hat{p}_B)
\;=\;
\int_0^1 \int_0^{p_2}
p\bigl(p_1, p_2 \mid \hat{p}_A, \hat{p}_B\bigr)\, dp_1\, dp_2
\]
- \(p_A\) (resp. \(p_B\)): The probability that a given set is correctly classified by method A (resp. method B).
- \(p_1\) (resp. \(p_2\)): The probability that a given set is well classified by method A (resp. method B) and not by method B (resp. method A).
-
\(p(p_1, p_2 \mid \hat{p}_A, \hat{p}_B)\):
\[
p(p_1, p_2 \mid \hat{p}_A, \hat{p}_B)
= D(x_1 + 1,\; x_2 + 1,\; n - x_1 - x_2 + 2),
\]
where \(D\) is the Dirichlet distribution.
This distribution naturally arises as a conjugate prior of the multinomial distribution
which models the likelihood of the different proportions, such as \(p_1\) and \(p_2\).
- \(n\): Test set sample size.
- \(\hat{p}_A\) (resp. \(\hat{p}_B\)): The reported accuracy of method A (resp. method B).
- \(\hat{p}_{1,1}\): Model congruence. The proportion of cases where both methods made correct predictions.
As we do not have access to this value, we make an assumption based on previous experiments.
- \(x_1 = n \bigl(\hat{p}_A - \hat{p}_{1,1}\bigr)\)
- \(x_2 = n \bigl(\hat{p}_B - \hat{p}_{1,1}\bigr)\)
The integral above is not tractable analytically and therefore our methodology computed using Monte Carlo sampling.
Specifically, one samples \(k\) times \((p_1, p_2)\) from the Dirichlet distribution and counts the
number of times \(M\) where \(p_1 \le p_2\). Then the probability of false claims is approximated by:
\[
P(p_A \le p_B \mid \hat{p}_A, \hat{p}_B)
\;\approx\;
\frac{M}{k}.
\]
Interpretation
Because the winner (method A) is defined as the one with the higher reported mean performance,
this probability is always bounded by 0.5. Larger probabilities indicate more fragile claims,
whereas smaller probabilities indicate a more robust superiority of method A.
-
A probability of false claims close to 0 means the reported winner is very likely to be truly better than the ranked second.
-
Values under 0.05 are often considered “robust” claims, but this threshold is context-dependent.
-
A probability close to 0.5 means the reported advantage is weak and may well be due to sampling noise.