2419

View Abstract / View Presentation

Do you Agree? An Exploration of Inter-rater Variability and Deep Learning Segmentation Uncertainty

Katharina Viktoria Hoebel^1,2, Christopher P Bridge^1,3, Jay Biren Patel^1,2, Ken Chang^1,2, Marco C Pinho¹, Xiaoyue Ma⁴, Bruce R Rosen¹, Tracy T Batchelor⁵, Elizabeth R Gerstner^1,5, and Jayashree Kalpathy-Cramer¹
¹Athinoula A. Martinos Center for Biomedical Imaging, Boston, MA, United States, ²Harvard-MIT Division of Health Sciences and Technology, Cambridge, MA, United States, ³MGH and BWH Center for Clinical Data Science, Boston, MA, United States, ⁴Department of Magnetic Resonance, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China, ⁵Stephen E. and Catherine Pappas Center for Neuro-Oncology, Massachusetts General Hospital, Boston, MA, United States

We show that uncertainty metrics that are extracted from an MC dropout segmentation model, trained on labels from only one rater, correlate with the inter-rater variability. This enables the identification of cases that are likely to exhibit a high disagreement between human raters in advance.

Figure 1: Correlation between the inter-rater Dice score and uncertainty measures (pooled data from the validation and test datasets). The marked test cases are shown in Figure 2.

Figure 2: Selected axial slices from three cases (marked A, B, C in Figure 1). For each of the three cases, the left panel shows an axial slice of the T2W-FLAIR image with the segmentation labels. The right panel shows the corresponding uncertainty maps (brighter areas correspond to higher uncertainty) illustrating areas of high and low uncertainty of the segmentation model. Segmentation labels: turquoise: overlap between the labels of rater 1 and 2; magenta: rater 1 only; orange: rater 2 only.