2419
Do you Agree? An Exploration of Inter-rater Variability and Deep Learning Segmentation Uncertainty
Katharina Viktoria Hoebel1,2, Christopher P Bridge1,3, Jay Biren Patel1,2, Ken Chang1,2, Marco C Pinho1, Xiaoyue Ma4, Bruce R Rosen1, Tracy T Batchelor5, Elizabeth R Gerstner1,5, and Jayashree Kalpathy-Cramer1
1Athinoula A. Martinos Center for Biomedical Imaging, Boston, MA, United States, 2Harvard-MIT Division of Health Sciences and Technology, Cambridge, MA, United States, 3MGH and BWH Center for Clinical Data Science, Boston, MA, United States, 4Department of Magnetic Resonance, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China, 5Stephen E. and Catherine Pappas Center for Neuro-Oncology, Massachusetts General Hospital, Boston, MA, United States
We show that uncertainty metrics that are extracted from an MC dropout segmentation model, trained on labels from only one rater, correlate with the inter-rater variability. This enables the identification of cases that are likely to exhibit a high disagreement between human raters in advance.
Figure 1: Correlation between the inter-rater Dice score and uncertainty measures (pooled data from the validation and test datasets). The marked test cases are shown in Figure 2.
Figure 2: Selected axial slices from three cases (marked A, B, C in Figure 1). For each of the three cases, the left panel shows an axial slice of the T2W-FLAIR image with the segmentation labels. The right panel shows the corresponding uncertainty maps (brighter areas correspond to higher uncertainty) illustrating areas of high and low uncertainty of the segmentation model. Segmentation labels: turquoise: overlap between the labels of rater 1 and 2; magenta: rater 1 only; orange: rater 2 only.