TY - GEN
T1 - Large-scale validation of artificial intelligence for breast cancer detection in Danish mammography screening
AU - Elhakim, Mohammad Talal
PY - 2024/2/21
Y1 - 2024/2/21
N2 - Early detection with mammography screening along with best practice treatment are recognised as crucial elements in reducing female breast cancerspecific mortality and morbidity. However, widespread capacity issues and shortage of breast radiologists could pose a threat to the continued feasibility and efficiency of the organised screening programme. Deep learning-based artificial intelligence (AI) systems have in recent years shown great advancements in the field of medical imaging, particularly relating to breast cancer detection on mammography. AI solutions have been proposed as a potential support tool for or replacement of reading radiologists within screening to alleviate workforce pressures. However, the generalisability of AI accuracy and feasibility to real-life screening has been questioned owing to methodological limitations in the existing literature, mainly due to non-representative populations and reference standards. Furthermore, there is little evidence on the impact of the selected AI threshold and site of deployment in the workflow onscreening outcome.This PhD thesis investigates the accuracy and feasibility of two commercial AI solutions (AI1 and AI2 ) for breast cancer detection in the context of populationwide double-read mammography screening. Based on a large-scale retrospective validation study, a study cohort of 272,008 consecutive screening mammograms was retrieved from the Region of Southern Denmark. Both AI1 and AI2 were validated in a Standalone AI setup compared to first reading radiologists and in an AI-integrated screening setup with AI simulated as first reader (Integrated AI) in comparison to double reading with arbitration (i.e. combined reading). The selected thresholds for AI were matched with the mean sensitivity and mean specificity of the first reader, AIsens and AIspec, respectively. To explore different applications, three simulated AI-integrated screening scenarios with AI2 were evaluated in comparison to the combined reading, including the scenario with AI as first reader (Scenario 1: Integrated AI2 first), Scenario 2 with AI as the second reader (Integrated AI2 second), and Scenario 3 with AI as a triage tool independently screen-reading low-risk and high-risk cases, while moderate-risk screenings were assessed by the original combined reading (Integrated AI2 triage). The analyses for AI1 and AI2 were based on study sample 1 and study sample 2, respectively, consisting of 257,671 and 249,402 screening mammograms, all with up to 24 months of follow-up. The accuracy of Standalone AI did not reach the level of first reading radiologists except for Standalone AI2 spec. When AI was simulated as first reader, the accuracy of the AI-integrated screening scenario with both AI1 and AI2 reached the level of the combined reading when AIspec was applied as threshold. Although this was followed by a slight increase in the arbitration rate, the overall reduction in human readings reached approximately -49% with a stable recall rate. In Scenario 2, Integrated AI2 second achieved a statistically significantly lower sensitivity at a trade-off of a lower recall rate, with an approximately similar screen-reading workload reduction as Integrated AI2 first (-48.7% and - 48.8%, respectively). In Scenario 3, Integrated AI2 triage achieved a statistically significantly higher sensitivity and lower arbitration rate with a reduction in screen-reading volume of -49.7% and a stable recall rate.The overall conclusion of the PhD thesis is that an AI solution with an appropriate threshold and site of deployment in the screening workflow can potentially be a feasible replacement of one reader or a partial replacement of both readers in double-read mammography screening.The most important limitations in the study were inherently related to the retrospective study design. Although the extent was difficult to assess, the reference standard was potentially affected by verification bias due to differing historic follow-up depending on the screen-reading decision as well as incorporation bias due to correlated radiologist readings with the reference standard, causing uncertainty regarding the true outcome status of each screening.In light of the findings and limitations in this PhD thesis, and from a clinical applicability point of view, there is a need for strong quality evaluation setups, including prospective controlled trials, prior to large-scale implementation to ensure that long-term cancer outcomes are not negatively affected by AIintegrated screening. The selection of an appropriate AI threshold and site of deployment in the screening workflow constitute important aspects for researchers and medical decision-makers to consider when choosing a local implementation strategy for AI in breast cancer screening.
AB - Early detection with mammography screening along with best practice treatment are recognised as crucial elements in reducing female breast cancerspecific mortality and morbidity. However, widespread capacity issues and shortage of breast radiologists could pose a threat to the continued feasibility and efficiency of the organised screening programme. Deep learning-based artificial intelligence (AI) systems have in recent years shown great advancements in the field of medical imaging, particularly relating to breast cancer detection on mammography. AI solutions have been proposed as a potential support tool for or replacement of reading radiologists within screening to alleviate workforce pressures. However, the generalisability of AI accuracy and feasibility to real-life screening has been questioned owing to methodological limitations in the existing literature, mainly due to non-representative populations and reference standards. Furthermore, there is little evidence on the impact of the selected AI threshold and site of deployment in the workflow onscreening outcome.This PhD thesis investigates the accuracy and feasibility of two commercial AI solutions (AI1 and AI2 ) for breast cancer detection in the context of populationwide double-read mammography screening. Based on a large-scale retrospective validation study, a study cohort of 272,008 consecutive screening mammograms was retrieved from the Region of Southern Denmark. Both AI1 and AI2 were validated in a Standalone AI setup compared to first reading radiologists and in an AI-integrated screening setup with AI simulated as first reader (Integrated AI) in comparison to double reading with arbitration (i.e. combined reading). The selected thresholds for AI were matched with the mean sensitivity and mean specificity of the first reader, AIsens and AIspec, respectively. To explore different applications, three simulated AI-integrated screening scenarios with AI2 were evaluated in comparison to the combined reading, including the scenario with AI as first reader (Scenario 1: Integrated AI2 first), Scenario 2 with AI as the second reader (Integrated AI2 second), and Scenario 3 with AI as a triage tool independently screen-reading low-risk and high-risk cases, while moderate-risk screenings were assessed by the original combined reading (Integrated AI2 triage). The analyses for AI1 and AI2 were based on study sample 1 and study sample 2, respectively, consisting of 257,671 and 249,402 screening mammograms, all with up to 24 months of follow-up. The accuracy of Standalone AI did not reach the level of first reading radiologists except for Standalone AI2 spec. When AI was simulated as first reader, the accuracy of the AI-integrated screening scenario with both AI1 and AI2 reached the level of the combined reading when AIspec was applied as threshold. Although this was followed by a slight increase in the arbitration rate, the overall reduction in human readings reached approximately -49% with a stable recall rate. In Scenario 2, Integrated AI2 second achieved a statistically significantly lower sensitivity at a trade-off of a lower recall rate, with an approximately similar screen-reading workload reduction as Integrated AI2 first (-48.7% and - 48.8%, respectively). In Scenario 3, Integrated AI2 triage achieved a statistically significantly higher sensitivity and lower arbitration rate with a reduction in screen-reading volume of -49.7% and a stable recall rate.The overall conclusion of the PhD thesis is that an AI solution with an appropriate threshold and site of deployment in the screening workflow can potentially be a feasible replacement of one reader or a partial replacement of both readers in double-read mammography screening.The most important limitations in the study were inherently related to the retrospective study design. Although the extent was difficult to assess, the reference standard was potentially affected by verification bias due to differing historic follow-up depending on the screen-reading decision as well as incorporation bias due to correlated radiologist readings with the reference standard, causing uncertainty regarding the true outcome status of each screening.In light of the findings and limitations in this PhD thesis, and from a clinical applicability point of view, there is a need for strong quality evaluation setups, including prospective controlled trials, prior to large-scale implementation to ensure that long-term cancer outcomes are not negatively affected by AIintegrated screening. The selection of an appropriate AI threshold and site of deployment in the screening workflow constitute important aspects for researchers and medical decision-makers to consider when choosing a local implementation strategy for AI in breast cancer screening.
KW - Artificial intelligence
KW - AI
KW - Breast cancer
KW - Screening
KW - Mammography
KW - Kunstig intelligens
KW - AI
KW - Brystkræft
KW - Screening
KW - Mammografi
U2 - 10.21996/zbhv-jy50
DO - 10.21996/zbhv-jy50
M3 - Ph.D. thesis
PB - Syddansk Universitet. Det Sundhedsvidenskabelige Fakultet
ER -