Counteracting Performance Degradation of Artificial Intelligence in Healthcare

Eline Sandvig Andersen*

*Corresponding author for this work

Research output: ThesisPh.D. thesis

Abstract

Background and aim:
The performance of clinical artificial intelligence (AI) models can deteriorate over time, as a consequence of shifts in the surrounding environment. At the time of model deployment, it is not possible to accurately predict if, how and when model performance will change. If performance degradation occurs, it can have harmful consequences, and it is therefore imperative that this risk of performance degradation is managed. There are numerous strategies for mitigating the risks associated with model performance degradation. However, evidence regarding concrete implementation of such strategies in healthcare remains limited. The overall aim of the three projects constituting this thesis was therefore to investigate particular methods for preventing, mitigating or detecting performance degradation of clinical AI models. 

Study I:
Objectives: Through simulation, to evaluate changes to the model of end stage liver disease (MELD) given various levels of input variations, thereby informing choice of analytical performance specifications suitable for keeping model performance consistent.

Method: Retrospectively, a cohort of 6093 consecutive MELD-scores were collected. Several diverse levels of variation were simulated onto the dataset. Resulting changes to the output were evaluated in terms of samples (%) changing by ≥ 1 MELD point and samples (%) changing risk category. The same procedure was applied to a constructed dataset representing the worstcase scenario in terms of maximum susceptibility to the input changes.

Results: The variation propagation in the model was complex and changes to the output depended on the levels of variation as well as on levels of the input variables themselves (i.e. the population), ranging from 0.02% of samples changing by ≥ 1 MELD point to 3.26% of samples changing by this amount.

Study II:
Objectives: To review and summarize methods for monitoring performance of clinical AI, while mapping the nature and extent of the evidence on the topic.

Method: Following the PRISMA and JBI guidelines, a scoping review was conducted, using searches in MEDLINE, Embase, Scopus and ProQuest as well as in the grey literature.

Results: 39 sources of evidence were included, of which the majority consisted of narrative reviews and simulation studies. Only one official guideline was identified. The most abundantly reported metrics were traditional performance metrics (e.g., predictive values). The review also identified other metrics and methods including some methods specifically developed for monitoring clinical AI.

Study III:
Objectives: To investigate the performance of a blood analysis based, 90-day cancer risk prediction model in the years following validation and to determine if simple input and output monitoring would have alerted for changes.

Method: Data pertaining to 7110 blood sample requests made from 2020 to 2023 were retrospectively collected. The performance of the model over time was evaluated in terms of predictive values, specificity, sensitivity and area under the ROC-curve (AUROC). Deployment of Shewhart control charts for proportion of valid requests (input monitoring) and proportion of positive predictions (output monitoring) was simulated and any alarms noted.

Results: The model remained stable until 4th quarter of 2023, when changes were made to the blood analysis package used for input causing model specificity to drop significantly. The output monitoring issued no alarms, whereas input monitoring issued alarms directly following the analysis profile changes.

Conclusions:
As the number of clinical AI applications expands, strategies for counteracting performance degradation will gain increasing importance. In this thesis, we demonstrate how measurement uncertainty can propagate in complex manners, and AI input providers may thus promote consistent model performance by taking the complexity into account, when defining limits for acceptable data quality. We further provide an overview of clinical AI performance monitoring methods, identifying a lack of guidance for practical implementation and we show how input monitoring can be used to detect critical changes to model input, alerting overseers of a potential impeding performance degradation.
Original languageEnglish
Awarding Institution
  • University of Southern Denmark
Supervisors/Advisors
  • Brandslund, Ivan, Principal supervisor
  • Lohman Brasen, Claus, Co-supervisor
  • Röttger, Richard, Co-supervisor
Date of defence30. May 2024
Publisher
DOIs
Publication statusPublished - 16. May 2024

Note re. dissertation

A print copy of the thesis can be accessed at the Library. 

Fingerprint

Dive into the research topics of 'Counteracting Performance Degradation of Artificial Intelligence in Healthcare'. Together they form a unique fingerprint.

Cite this