Establishing a gold standard for test sets: variation in interpretive agreement of expert mammographers.
RATIONALE AND OBJECTIVES:
Test sets for assessing and improving radiologic image interpretation have been used for decades and typically evaluate performance relative to gold standard interpretations by experts. To assess test sets for screening mammography, a gold standard for whether a woman should be recalled for additional workup is needed, given that interval cancers may be occult on mammography and some findings ultimately determined to be benign require additional imaging to determine if biopsy is warranted. Using experts to set a gold standard assumes little variation occurs in their interpretations, but this has not been explicitly studied in mammography.
MATERIALS AND METHODS:
Using digitized films from 314 screening mammography exams (n = 143 cancer cases) performed in the Breast Cancer Surveillance Consortium, we evaluated interpretive agreement among three expert radiologists who independently assessed whether each examination should be recalled, and the lesion location, finding type (mass, calcification, asymmetric density, or architectural distortion), and interpretive difficulty in the recalled images.
Agreement among the three expert pairs for recall/no recall was higher for cancer cases (mean 74.3 ± 6.5) than for noncancers (mean 62.6 ± 7.1). Complete agreement on recall, lesion location, finding type and difficulty ranged from 36.4% to 42.0% for cancer cases and from 43.9% to 65.6% for noncancer cases. Two of three experts agreed on recall and lesion location for 95.1% of cancer cases and 91.8% of noncancer cases, but all three experts agreed on only 55.2% of cancer cases and 42.1% of noncancer cases.
Variability in expert interpretive is notable. A minimum of three independent experts combined with a consensus should be used for establishing any gold standard interpretation for test sets, especially for noncancer cases.