Probabilities of False Claims Toolkit

How does it work?

Claims Reloaded provides a rigorous statistical framework for evaluating the robustness of performance claims in machine learning benchmarks. This section introduces the foundational concepts and mathematical methodology used to quantify the probability that a reported outperformance may be due to sampling variability rather than genuine algorithmic superiority. Understanding the definitions and methodologies is essential for interpreting the tool's output and making informed assessments of published results. For a comprehensive description of the methodological process and underlying assumptions, please refer to [1].

Definitions [1]

Test set: The independent set of cases on which both methods A (reported winner) and B (second-ranked method) are evaluated. The size of the test set (n) strongly influences statistical certainty.
Metric (DSC): The Dice Similarity Coefficient, commonly reported in segmentation benchmarks. Values range from 0 to 1.
Method A (ranked first): The algorithm reported as the winner.
Method B (ranked second): The algorithm reported as the next-best performing method.
Model congruence: Measure of how similarly two methods perform across the same test cases. For segmentation tasks, it is measured as the correlation between per-case segmentation DSC scores obtained by the two methods. For segmentation tasks, it is defined as the proportion of test samples correctly classified by both methods. High congruence means both methods tend to succeed or fail on the same cases.
Posterior probability of B ≥ A: The Bayesian posterior probability that the true mean performance of method B is greater than or equal to that of method A, conditional on the observed mean scores and test set size. By definition, this probability cannot exceed 0.5, because method A is reported as superior.

Methodology for Segmentation

The probability of a false outperformance claim for segmentation tasks is defined as:

\[ P(\theta_A \le \theta_B \mid \text{reported results}) \;=\; P(\mu_A, \mu_B \mid \hat{\mu}_A, \hat{\mu}_B) \;=\; t_{n-1}\!\left( \sqrt{n}\,\frac{\hat{\mu}_B - \hat{\mu}_A}{ \sqrt{\,s_A^2 + s_B^2 - 2\,s_A s_B\, r_{AB}\,} } \right) \]

\(μ_A\), \(μ_B\): True mean DSC scores of methods A and B.
\(μ̂_A\), \(μ̂_B\): Reported mean DSC scores of methods A and B.
\(n\): Test set sample size.
\(t_{n−1}\): The quantile of the Student distribution with n − 1 degrees of freedom.
\(s_A\), \(s_B\): The standard deviation for methods A and B imputed from \(μ̂_A\), \(μ̂_B\).
\(r_{AB}\): Model congruence. The correlation between the performance of method A and method B. As we do not have access to this value, we make an assumption based on previous experiments.

Methodology for Classification

The probability of false outperformance claims for classification tasks is defined as:

\[ P(\theta_A \le \theta_B \mid \text{reported results}) \;=\; P(p_A \le p_B \mid \hat{p}_A, \hat{p}_B) \;=\; P(p_1 \le p_2 \mid \hat{p}_A, \hat{p}_B) \;=\; \int_0^1 \int_0^{p_2} p\bigl(p_1, p_2 \mid \hat{p}_A, \hat{p}_B\bigr)\, dp_1\, dp_2 \]

\(p_A\) (resp. \(p_B\)): The probability that a given set is correctly classified by method A (resp. method B).
\(p_1\) (resp. \(p_2\)): The probability that a given set is well classified by method A (resp. method B) and not by method B (resp. method A).
\(p(p_1, p_2 \mid \hat{p}_A, \hat{p}_B)\): \[ p(p_1, p_2 \mid \hat{p}_A, \hat{p}_B) = D(x_1 + 1,\; x_2 + 1,\; n - x_1 - x_2 + 2), \] where \(D\) is the Dirichlet distribution. This distribution naturally arises as a conjugate prior of the multinomial distribution which models the likelihood of the different proportions, such as \(p_1\) and \(p_2\).
\(n\): Test set sample size.
\(\hat{p}_A\) (resp. \(\hat{p}_B\)): The reported accuracy of method A (resp. method B).
\(\hat{p}_{1,1}\): Model congruence. The proportion of cases where both methods made correct predictions. As we do not have access to this value, we make an assumption based on previous experiments.
\(x_1 = n \bigl(\hat{p}_A - \hat{p}_{1,1}\bigr)\)
\(x_2 = n \bigl(\hat{p}_B - \hat{p}_{1,1}\bigr)\)

The integral above is not tractable analytically and therefore our methodology computed using Monte Carlo sampling. Specifically, one samples \(k\) times \((p_1, p_2)\) from the Dirichlet distribution and counts the number of times \(M\) where \(p_1 \le p_2\). Then the probability of false claims is approximated by:

\[ P(p_A \le p_B \mid \hat{p}_A, \hat{p}_B) \;\approx\; \frac{M}{k}. \]

Interpretation

Because the winner (method A) is defined as the one with the higher reported mean performance, this probability is always bounded by 0.5. Larger probabilities indicate more fragile claims, whereas smaller probabilities indicate a more robust superiority of method A.

A probability of false claims close to 0 means the reported winner is very likely to be truly better than the ranked second.
Values under 0.05 are often considered “robust” claims, but this threshold is context-dependent.
A probability close to 0.5 means the reported advantage is weak and may well be due to sampling noise.

Configure report

[1] Christodoulou, E. et al. False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims. ArXiv (2025). https://arxiv.org/pdf/2505.04720