TY - JOUR
T1 - Overcoming data scarcity in radiomics/radiogenomics using synthetic radiomic features
AU - Ahmadian, Milad
AU - Bodalal, Zuhir
AU - van der Hulst, Hedda J.
AU - Vens, Conchita
AU - Karssemakers, Luc H.E.
AU - Bogveradze, Nino
AU - Castagnoli, Francesca
AU - Landolfi, Federica
AU - Hong, Eun Kyoung
AU - Gennaro, Nicolo
AU - Pizzi, Andrea Delli
AU - Beets-Tan, Regina G.H.
AU - van den Brekel, Michiel W.M.
AU - Castelijns, Jonas A.
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2024/5
Y1 - 2024/5
N2 - Purpose: To evaluate the potential of synthetic radiomic data generation in addressing data scarcity in radiomics/radiogenomics models. Methods: This study was conducted on a retrospectively collected cohort of 386 colorectal cancer patients (n = 2570 lesions) for whom matched contrast-enhanced CT images and gene TP53 mutational status were available. The full cohort data was divided into a training cohort (n = 2055 lesions) and an independent and fixed test set (n = 515 lesions). Differently sized training sets were subsampled from the training cohort to measure the impact of sample size on model performance and assess the added value of synthetic radiomic augmentation at different sizes. Five different tabular synthetic data generation models were used to generate synthetic radiomic data based on “real-world” radiomics data extracted from this cohort. The quality and reproducibility of the generated synthetic radiomic data were assessed. Synthetic radiomics were then combined with “real-world” radiomic training data to evaluate their impact on the predictive model's performance. Results: A prediction model was generated using only “real-world” radiomic data, revealing the impact of data scarcity in this particular data set through a lack of predictive performance at low training sample numbers (n = 200, 400, 1000 lesions with average AUC = 0.52, 0.53, and 0.56 respectively, compared to 0.64 when using 2055 training lesions). Synthetic tabular data generation models created reproducible synthetic radiomic data with properties highly similar to “real-world” data (for n = 1000 lesions, average Chi-square = 0.932, average basic statistical correlation = 0.844). The integration of synthetic radiomic data consistently enhanced the performance of predictive models trained with small sample size sets (AUC enhanced by 9.6%, 11.3%, and 16.7% for models trained on n_samples = 200, 400, and 1000 lesions, respectively). In contrast, synthetic data generated from randomised/noisy radiomic data failed to enhance predictive performance underlining the requirement of true signal data to do so. Conclusion: Synthetic radiomic data, when combined with real radiomics, could enhance the performance of predictive models. Tabular synthetic data generation might help to overcome limitations in medical AI stemming from data scarcity.
AB - Purpose: To evaluate the potential of synthetic radiomic data generation in addressing data scarcity in radiomics/radiogenomics models. Methods: This study was conducted on a retrospectively collected cohort of 386 colorectal cancer patients (n = 2570 lesions) for whom matched contrast-enhanced CT images and gene TP53 mutational status were available. The full cohort data was divided into a training cohort (n = 2055 lesions) and an independent and fixed test set (n = 515 lesions). Differently sized training sets were subsampled from the training cohort to measure the impact of sample size on model performance and assess the added value of synthetic radiomic augmentation at different sizes. Five different tabular synthetic data generation models were used to generate synthetic radiomic data based on “real-world” radiomics data extracted from this cohort. The quality and reproducibility of the generated synthetic radiomic data were assessed. Synthetic radiomics were then combined with “real-world” radiomic training data to evaluate their impact on the predictive model's performance. Results: A prediction model was generated using only “real-world” radiomic data, revealing the impact of data scarcity in this particular data set through a lack of predictive performance at low training sample numbers (n = 200, 400, 1000 lesions with average AUC = 0.52, 0.53, and 0.56 respectively, compared to 0.64 when using 2055 training lesions). Synthetic tabular data generation models created reproducible synthetic radiomic data with properties highly similar to “real-world” data (for n = 1000 lesions, average Chi-square = 0.932, average basic statistical correlation = 0.844). The integration of synthetic radiomic data consistently enhanced the performance of predictive models trained with small sample size sets (AUC enhanced by 9.6%, 11.3%, and 16.7% for models trained on n_samples = 200, 400, and 1000 lesions, respectively). In contrast, synthetic data generated from randomised/noisy radiomic data failed to enhance predictive performance underlining the requirement of true signal data to do so. Conclusion: Synthetic radiomic data, when combined with real radiomics, could enhance the performance of predictive models. Tabular synthetic data generation might help to overcome limitations in medical AI stemming from data scarcity.
KW - Data augmentation
KW - Data scarcity
KW - Medical imaging
KW - Radiogenomics
KW - Radiomics
KW - Synthetic data generation
U2 - 10.1016/j.compbiomed.2024.108389
DO - 10.1016/j.compbiomed.2024.108389
M3 - Journal article
C2 - 38593640
AN - SCOPUS:85189659543
SN - 0010-4825
VL - 174
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 108389
ER -