AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is often assumed to be fixed and certain. However, in many domains, such as medical applications, the ground truth is often curated in the form of differential diagnoses provided by multiple experts. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through potential disagreement.
In this talk, I will argue that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions, leading to unanticipated failure modes. We propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire ground truth uncertainty. We conclude that, while assuming a crisp ground truth can be acceptable for many AI applications, a more nuanced evaluation protocol should be utilized in medical diagnosis. If time permits, I will also cover some work, based on conformal methods that can provide statistical guarantees.