TY - GEN
T1 - The clinical potential of artificial intelligence in early detection of lung cancer
AU - Høstgaard Bang Henriksen, Margrethe
PY - 2025/1/27
Y1 - 2025/1/27
N2 - Lung cancer (LC) is currently the leading cause of cancer-related deaths, highlighting the
critical necessity for early detection, which is essential for providing curative treatment.
While screening for LC is gradually introduced through pilot studies across various
countries, discussions persist regarding the optimal selection criteria. Numerous studies
have highlighted the superiority of individual prediction models over the widespread
categorical standard criteria based solely on age and smoking intensity.The overall aim of this thesis was to explore and refine LC detection models based on
artificial intelligence (AI) utilizing data obtained from clinical health records and registries.
The issue was addressed from several angles, resulting in the incorporation of five articles
in this thesis.The first four studies revolved around data derived from a high-risk cohort of patients
evaluated in the LC fast-track clinics in the Region of Southern Denmark. Extensive
clinical and laboratory data were collected from this cohort of nearly 40,000 individuals,
including 25% of which were LC patients. Associations between data variables and LC
status were examined in Article I, laying the groundwork for subsequent prediction
models. The initial findings let to the usage of smoking and laboratory data in the
development of prediction models employing both a machine learning (ML) approach
(Article II) and a Bayesian Networks (BN) approach (Article III). The ML model
exhibited similar performance to the BN approach with a mean area under the receiver
operating characteristic (ROC) curve (AUC) of 0.77 compared to AUC 0.76, and both with
a sensitivity of 21% at a fixed specificity of 95%. The ML model identified smoking status,
lactate dehydrogenase, age and plasma calcium levels as the most important factors for
detection of LC. The BN model demonstrated performance robustness when introduced to
missing data (up to 30%), a notable advantage when working with clinical data analysis. Additional data types such as symptoms at diagnosis, comorbidities, and medication were
integrated into an expanded BN model, investigating whether a more comprehensive
dataset could enhance model performance (Article IV). The best-performing model
achieved an AUC of 0.79 and was developed using comorbidity, laboratory results, and
smoking data on a relatively large dataset with 21% missing variables. Additionally, a
model developed on a small but complete dataset proved to be stable when applied to larger
datasets with up to 39% missing data, indicating its applicability in individuals with
incomplete data. While laboratory results and smoking status were the strongest predictors
of LC, comorbidity (including data on medication and data from general practice) and
symptoms at diagnosis appeared to be the least informative. While the first four studies focused solely on high-risk patients, we aimed to extrapolate
these findings to a potentially lower-risk population eligible for LC screening. Therefore,
we assessed the risk of LC and the overlap with LC fast-track clinics among chronic
obstructive pulmonary disease (COPD) outpatients (Article V). Within this cohort, we
observed a 5% risk of LC, surpassing the risk in the general population more than tenfold.
Importantly, LC patients with COPD were diagnosed at an earlier stage than LC patients
without COPD. Additionally, 18% of COPD outpatients were referred to LC diagnostics at
some point. While this high referral rate may be due to increased medical attention, it
suggests potential benefits of a regular and systematic screening approach for these
patients.The insights and methodology outlined in this thesis serve as foundational elements for our
ongoing research, which aims to integrate risk models into a practical clinical screening
context. A highly effective model capable of early-stage LC prediction will enhance
screening efficacy and promote early detection, ultimately leading to improved survival
rates.
AB - Lung cancer (LC) is currently the leading cause of cancer-related deaths, highlighting the
critical necessity for early detection, which is essential for providing curative treatment.
While screening for LC is gradually introduced through pilot studies across various
countries, discussions persist regarding the optimal selection criteria. Numerous studies
have highlighted the superiority of individual prediction models over the widespread
categorical standard criteria based solely on age and smoking intensity.The overall aim of this thesis was to explore and refine LC detection models based on
artificial intelligence (AI) utilizing data obtained from clinical health records and registries.
The issue was addressed from several angles, resulting in the incorporation of five articles
in this thesis.The first four studies revolved around data derived from a high-risk cohort of patients
evaluated in the LC fast-track clinics in the Region of Southern Denmark. Extensive
clinical and laboratory data were collected from this cohort of nearly 40,000 individuals,
including 25% of which were LC patients. Associations between data variables and LC
status were examined in Article I, laying the groundwork for subsequent prediction
models. The initial findings let to the usage of smoking and laboratory data in the
development of prediction models employing both a machine learning (ML) approach
(Article II) and a Bayesian Networks (BN) approach (Article III). The ML model
exhibited similar performance to the BN approach with a mean area under the receiver
operating characteristic (ROC) curve (AUC) of 0.77 compared to AUC 0.76, and both with
a sensitivity of 21% at a fixed specificity of 95%. The ML model identified smoking status,
lactate dehydrogenase, age and plasma calcium levels as the most important factors for
detection of LC. The BN model demonstrated performance robustness when introduced to
missing data (up to 30%), a notable advantage when working with clinical data analysis. Additional data types such as symptoms at diagnosis, comorbidities, and medication were
integrated into an expanded BN model, investigating whether a more comprehensive
dataset could enhance model performance (Article IV). The best-performing model
achieved an AUC of 0.79 and was developed using comorbidity, laboratory results, and
smoking data on a relatively large dataset with 21% missing variables. Additionally, a
model developed on a small but complete dataset proved to be stable when applied to larger
datasets with up to 39% missing data, indicating its applicability in individuals with
incomplete data. While laboratory results and smoking status were the strongest predictors
of LC, comorbidity (including data on medication and data from general practice) and
symptoms at diagnosis appeared to be the least informative. While the first four studies focused solely on high-risk patients, we aimed to extrapolate
these findings to a potentially lower-risk population eligible for LC screening. Therefore,
we assessed the risk of LC and the overlap with LC fast-track clinics among chronic
obstructive pulmonary disease (COPD) outpatients (Article V). Within this cohort, we
observed a 5% risk of LC, surpassing the risk in the general population more than tenfold.
Importantly, LC patients with COPD were diagnosed at an earlier stage than LC patients
without COPD. Additionally, 18% of COPD outpatients were referred to LC diagnostics at
some point. While this high referral rate may be due to increased medical attention, it
suggests potential benefits of a regular and systematic screening approach for these
patients.The insights and methodology outlined in this thesis serve as foundational elements for our
ongoing research, which aims to integrate risk models into a practical clinical screening
context. A highly effective model capable of early-stage LC prediction will enhance
screening efficacy and promote early detection, ultimately leading to improved survival
rates.
KW - Lung cancer
KW - early detection
KW - prediction models
KW - machine learning
KW - bayesian networks
KW - artificial intelligence
KW - screening
KW - screening models
U2 - 10.21996/feab38bd-ecdf-4b1d-9cc6-5e8e91a483bd
DO - 10.21996/feab38bd-ecdf-4b1d-9cc6-5e8e91a483bd
M3 - Ph.D. thesis
PB - Syddansk Universitet. Det Sundhedsvidenskabelige Fakultet
ER -