ABSTRACT
PURPOSE
Patients with advanced non-small cell lung cancer (NSCLC) have varying responses to immunotherapy, but there are no reliable, accepted biomarkers to accurately predict its therapeutic efficacy. The present study aimed to construct individualized models through automatic machine learning (autoML) to predict the efficacy of immunotherapy in patients with inoperable advanced NSCLC.
METHODS
A total of 63 eligible participants were included and randomized into training and validation groups. Radiomics features were extracted from the volumes of interest of the tumor circled in the preprocessed computed tomography (CT) images. Golden feature, clinical, radiomics, and fusion models were generated using a combination of various algorithms through autoML. The models were evaluated using a multi-class receiver operating characteristic curve.
RESULTS
In total, 1,219 radiomics features were extracted from regions of interest. The ensemble algorithm demonstrated superior performance in model construction. In the training cohort, the fusion model exhibited the highest accuracy at 0.84, with an area under the curve (AUC) of 0.89–0.98. In the validation cohort, the radiomics model had the highest accuracy at 0.89, with an AUC of 0.98–1.00; its prediction performance in the partial response subgroup outperformed that in both the clinical and radiomics models. Patients with low rad scores achieved improved progression-free survival (PFS); (median PFS 16.2 vs. 13.4, P = 0.009).
CONCLUSION
autoML accurately and robustly predicted the short-term outcomes of patients with inoperable NSCLC treated with immune checkpoint inhibitor immunotherapy by constructing CT-based radiomics models, confirming it as a powerful tool to assist in the individualized management of patients with advanced NSCLC.
CLINICAL SIGNIFICANCE
This article highlights that autoML promotes the accuracy and efficiency of feature selection and model construction. The radiomics model generated by autoML predicted the efficacy of immunotherapy in patients with advanced NSCLC effectively. This may provide a rapid and non-invasive method for making personalized clinical decisions.
Main points
• Radiomics modeling based on computed tomography images predicted the efficacy of immunotherapy in patients with advanced non-small cell lung cancer effectively.
• Automatic machine learning can integrate multiple algorithms to obtain improved predictive capabilities.
• The diagnostic performance of the radiomics model outperformed that of the clinical model.
• Patients with lower rad scores achieved superior progression-free survival.
Non-small cell lung cancer (NSCLC) is a prevalent and malignant tumor with high incidence and mortality rates globally.1 Over 30% of new NSCLC cases are diagnosed at locally advanced stages [tumor–node–metastasis (TNM) stage III]. The absence of notable early symptoms often leads to diagnoses at advanced stages or after local metastasis has occurred, which frequently delays surgical treatment.
The current standard treatment for patients with advanced NSCLC involves concurrent chemoradiotherapy followed by immunotherapy.2 Definitive efficacy and improved prognoses have been achieved in all stages of NSCLC with the use of immune checkpoint inhibitors (ICIs), either alone or in combination with chemotherapy.3, 4 In the CHECKMATE-816 clinical trial, nivolumab combined with chemotherapy extended event-free survival (EFS) by 10.8 months and decreased the risk by 37% compared with the control group [hazard ratio (HR) 0.63, confidence interval (CI): 0.43–0.91, P = 0.0052].5 Furthermore, the recent NEOTORCH trial reported a similar extension in EFS and a significantly higher pathological complete response (CR) rate (24.8% vs. 1.0%, P < 0.0001) in the group receiving combined immune-chemotherapy.6However, in the Pacific trial (NCT02125461), only one-third of patients who received adjuvant therapy with durvalumab remained disease-free after 5 years,7, 8 indicating that immunotherapy may not be suitable for all patients due to factors such as the specific tumor immune microenvironment, residual toxicity, and societal expense. Effective immunotherapy is often positively correlated with high programmed death-ligand 1 (PD-L1) expression and the tumor mutation burden (TMB), but these require tissue from biopsies for detection. The challenge of not being able to perform repeated biopsies after developing chemo-resistance complicates treatment options for patients at an advanced stage. Therefore, there is an urgent need to develop non-invasive methods to accurately predict the efficacy of immunotherapy, which could benefit a broader group of patients.
In recent years, thin-slice computed tomography (CT) scans have become integral in diagnosing and staging NSCLC.9, 10 With advancements in medical imaging, there has been a transition from traditional qualitative diagnosis to the extraction of multimodal image data for quantitative analysis. Radiomics, a promising tool in image analysis, allows for the extraction of high-throughput features from imaging data. These features, combined with specific modeling techniques, can enhance the accuracy of disease diagnosis, differentiation, and prognosis evaluation.11 Previously, we developed and implemented delta radiomics diagnostic features to refine and personalize the diagnosis of invasive adenocarcinoma in lung partial solid nodules.12
Automatic machine learning (autoML) algorithms have facilitated the analysis of complex, large-sample data into predictive models and automated classifications. By integrating substantial amounts of data from radiology, pathology, genomics, and proteomics, autoML has enhanced clinical decision-making.13 In the present study, we aimed to identify effective radiomics features in CT images using autoML and integrate them with clinical features to develop a fusion model for individualized efficacy prediction and progression assessment in patients with advanced NSCLC receiving immunotherapy.
Methods
Study design and population
In this retrospective observational single-center study, we reviewed patients with NSCLC who underwent ICI treatment at Huadong Hospital between January 2020 and December 2022. The inclusion criteria were as follows: (1) >18 years; (2) receiving ICI treatment (anti-PD-1/PD-L1) at Huadong Hospital for the first time; (3) a clinically confirmed diagnosis of unresectable locally advanced stage NSCLC [stage III–IV, Union for International Cancer Control/American Joint Committee on Cancer (8th edition)]; and (4) available thin-slice CT images (1–1.25 mm), with lesions delineated and evaluated. The exclusion criteria were as follows: (1) a pathologically confirmed diagnosis of small cell lung cancer; (2) a history of malignancies other than NSCLC; (3) poor CT image quality with artifacts; and (4) failure to extract radiomics features due to other reasons.
Finally, a total of 63 eligible cases were enrolled (Figure 1). The clinical features before receiving ICIs were collected from medical records, including age, gender, smoking history, the time of diagnosis, pathological type, tumor location, the maximum diameter of the primary tumor site, clinical tumor stage, metastatic location, driver gene mutation, the start time and type of ICI treatment, treatment regimen, and disease progression and survival information. The efficacy evaluation was based on the immune-related response evaluation criteria in solid tumors,14 which classifies outcomes as CR, partial response (PR), stable disease (SD), and PD. The disease control rate (DCR) refers to the sum of all patients who were CR, PR, and SD. All the enrolled cases were further separated into a training and a validation cohort randomly after adjusting for potential confounders. The study was approved by the Ethics Committee of Huadong Hospital, and the requirement for informed consort was waived (approval no.: 2022K033, date:XXX).
Computed tomography image acquisition
The patients in this study were all subjected to non-contrast-enhanced CT performed on two scanners: a Somatom Definition Flash scanner (Siemens Medical Solutions, Erlangen, Germany) and a GE Discovery CT750 HD scanner (GE Healthcare, MO, USA) at 120 kV. The detailed scanning parameters are shown in Supplementary Table 1. The overall scanning range was from the lung apex to the bilateral adrenal gland. During the examination, the patients were instructed to lie in a supine position and inhale deeply with both arms raised.
Target segmentation and radiomics features extraction
According to the target lesions on the axial slices of the initial CT scans, the volumes of interest (VOIs) were manually marked by two experienced radiologists, each with 5 years’ expertise in diagnosing chest CT images, to achieve three-dimensional (3D) segmentation using the open-source 3D Slicer software (version 4.13.0; National Institutes of Health).
The extraction of radiomic features from these tumor VOIs was automatically performed using pyRadiomics (version: 3.0.1).15 To assess the inter-rater reliability between the radiologists, the intraclass correlation coefficient (ICC) was employed, with ICC >0.75 indicating a high level of agreement. The types of radiomic features extracted included grayscale, shape, texture, and wavelet transform features.
Feature selection and model construction
Due to the broad variability in the initial dataset, the data underwent normalization to control the radiomics features within a standardized intensity range. Feature selection was performed within the training cohort. The MLJAR platform, an open-source software based on Python, was employed for predictive feature selection and modeling.16This platform is designed to automatically address missing data by implementing strategies such as mean or median imputation to maintain data integrity. It also manages categorical variables by automatically performing encoding transformations, such as one-hot encoding or label encoding, enabling machine learning algorithms to effectively interpret these features. Subsequently, a feature engineering step was undertaken to create “golden features” that possess enhanced predictive power, derived from the original dataset features through operations such as addition, subtraction, multiplication, and division. Throughout the training phase, MLJAR assessed the significance of each feature using techniques such as permutation importance or SHapley Additive exPlanations, providing a quantitative measure of each feature’s impact on the model’s predictive accuracy and offering insight into the underlying decision-making processes of the model.
Afterward, in the “competition” mode of MLJAR, the software sought the most effective algorithms from a range, including linear regression, light gradient-boosting machine (LightGBM), eXtreme gradient boosting, neural networks (NN), and random forest (RF). Additionally, it considered assembling multiple algorithms to finalize the modeling process. The rad score was obtained by multiplying the coefficients of each feature by its value and then summing the results to get the final value.
The predictive model, which included clinical, radiomics, and fusion models, was developed using the aforementioned autoML algorithms. The efficacy of each model was assessed through receiver operator characteristic (ROC) curves for both the training and validation cohorts. Subsequently, the area under the curve (AUC) was calculated to determine the predictive accuracy of each constructed model.
Statistical analysis
The feature extraction and statistical analysis procedures were conducted using R software (version 3.6.2; http://www.Rproject.org and SPSS 22 (IBM, IL, USA). Categorical variables were analyzed using Fisher’s exact test. To evaluate the multi-class ROC curves, both the macro-AUC and micro-AUC were calculated. The macro-AUC averaged the AUC values from each category, whereas the micro-AUC computed the weighted average after evaluating each category independently. Furthermore, model performance was assessed using statistical metrics such as accuracy, precision, recall, and F1-score.
Model performance was evaluated by ROC analysis, and the significance level of curves was compared using the DeLong test. A COX regression analysis was utilized to investigate factors associated with disease progression and survival. Survival rates were analyzed using the Kaplan–Meier method, and survival data comparisons were conducted with the log-rank test. A two-sided P value less than 0.05 was considered statistically significant for all tests.
Results
Basic characteristics of patients
The basic characteristics of the patients are listed in Table 1. In total, 63 patients with advanced NSCLC who had received ICIs in our hospital were randomly divided into the training cohort (n = 44, PR: 15, SD: 7, and PD: 22) and the validation cohort (n = 19, PR: 10, SD: 1, and PD: 8) based on the efficacy evaluation (Supplementary Table 2).
In the training cohort, differences were observed in the tumor pathological types of patients with various curative effects [lung squamous cell cancer vs. lung adenocarcinoma (LUAD), 13 vs. 31, P = 0.034]. In the validation cohort, a difference in the clinical TNM (cTNM) stage was observed (cTNM III vs. cTNM IV, 4 vs. 15, P = 0.041). No differences were observed in age, gender, tumor location, driver gene mutations, smoking history, PD-L1 expression, or combination therapy among the patients (all P > 0.05).
Selection of radiomics and clinical golden features
The radiomics feature selection workflow is shown in Figure 2. The VOIs were automatically extracted, yielding a total of 1,219 features. Within the training cohort, the golden features, regarded as the most predictive features, were selected for the subsequent model construction by autoML. Among the radiomics features, based on the superior performance of the LightGBM algorithm, log-sigma-4-0mm_ Glrlm_ Lowgraylevelrunemphasis had the highest mean of feature importance; the top 25 golden features are listed in Supplementary Figure 1. The rad scores for patients undergoing ICI treatment were significantly lower in the DCR group than in the PD group in both the training (0.105 ± 0.284 vs. 0.502 ± 0.318, P < 0.001) and the validation cohorts (0.119 ± 0.224 vs. 0.528 ± 0.262, P = 0.002) (Supplementary Figure 2).
Among the clinical features, ten golden features were identified and selected for model building using autoML. Among these, the feature representing the combination with chemotherapy (feature 11) was identified as the most critical (Supplementary Figure 3).
Model construction and performance comparison
Based on the input of golden features with the highest importance, different learning algorithms were selected for establishing each model (Supplementary Figure 4). The ensemble algorithm demonstrated the lowest log-loss value in both the clinical and fusion models, indicating greater accuracy and a superior alignment between the predicted results and actual outcomes. In the radiomics model, the performance matched that of LightGBM, also suggesting improved accuracy and consistency.
Our study has shown that in both the radiomics and fusion models, the micro-AUC and macro-AUC were higher than those in the clinical model across the training and validation cohorts. In terms of accuracy, the fusion model scored the highest in the training cohort with 0.84, whereas the radiomics model outperformed the other models in the validation cohort with 0.89. In the training cohort, the radiomics and fusion models both exhibited optimal performance in SD, with an AUC of 0.96 (95% CI, 0.638–1.000) and 0.98 (95% CI, 0.676–0.996), respectively. In the validation cohort, the AUC of the radiomics model in three subgroups (PR, PD, and SD) were all higher than in the clinical and fusion models. Additionally, in the validation cohort, the PR subgroup exhibited better recall values and F1-scores than the SD and PD subgroups in both the clinical and radiomics models, suggesting enhanced predictive performance for this subgroup (Table 2, Figure 3).
Model prediction of progression-free and overall survival
All the enrolled patients were followed up for progression-free survival (PFS) and overall survival (OS), including 30 disease-progressed cases and 8 deaths, with a median follow-up time of 20 months (range: 3–47 months). Based on a nomogram derived from the multivariate COX regression analysis, patients undergoing ICI treatment were divided into high and low rad-score groups, with a threshold of 0.3 (Figure 4a). Regression analysis confirmed that the rad score was a more accurate predictor of progression risk than clinical factors (HR: 0.25, 95% CI: 0.10–0.63, P = 0.004) (Figure 4b). Although there was no significant difference in OS between the high and low rad-score groups (20.2 vs. 21.8 months, P = 0.056), the median PFS was notably longer in the low-score group, at 16.2 months, compared with 13.4 months in the high-score group (P = 0.009) (Supplementary Figure 5). The above data suggest that patients with low rad scores, as determined by the radiomics model, tend to experience less progression following immunotherapy.
Discussion
In the present study, we developed and validated a radiomics-based model using autoML algorithms to non-invasively assess the efficacy of immunotherapy in patients with inoperable advanced NSCLC. The findings revealed that the model, which incorporates features from CT images, displayed robust capabilities for diagnostics as well as for predicting therapeutic efficacy and disease progression.
In addition to PD-L1 expression, recent studies have shown that ICIs are highly effective in patients with high microsatellite instability or deficient mismatch repair (dMMR). Tumor cells with dMMR characteristics tend to have a higher TMB, which leads to the production of a considerable number of neoantigens. These neoantigens facilitate the recruitment of lymphocytes that become tumor-infiltrating lymphocytes, inhibiting tumor growth and enhancing the efficacy of immunotherapy.17, 18 However, these markers are typically identified through pathological immunohistochemistry or next-generation sequencing analysis, which require invasive tissue sampling and are costly. Therefore, there is a need for non-invasive, cost-effective, and accurate predictive methods using radiomics.
Progress in computerized imaging technology has led to the production of higher-definition images, enhancing radiomics’ ability to extract more intricate features than traditional imaging methods. This advancement supports the performance of high-dimensional quantitative analysis, providing additional insights for clinical decision-making.19 At present, numerous researchers have developed models with refined features that demonstrate high evaluation efficacy in various NSCLC application scenarios. These models have proven effective in predicting lesion benignity and malignancy, lymph node metastasis, driver mutations, and the severity of adverse effects.20-23 For example, Yoon et al.24 discovered that CT imaging features could non-invasively predict PD-L1 expression, identifying that validated radiomics models had greater discriminatory power than those generated from clinical features alone in an advanced LUAD cohort. Similarly, Trebeschi et al.25 identified a non-invasive machine learning biomarker capable of differentiating between responders and non-responders to immunotherapy, and this model achieved an AUC value of 0.83 in lung cancer studies.24
In all our models, the predictive performance for the PR subgroup exceeded that for the PD subgroup. These results suggest that our model aided in identifying patients who are likely to benefit from immunotherapy. However, the diagnostic consistency for the SD subgroup in the validation cohort remained uncertain due to the limited sample size. Previous studies typically focused on binary outcomes, such as categorizing responses as effective or ineffective or progressive and non-progressive, which often excluded patients in the SD subgroup. The antitumor effect in the SD subgroup is considered ambiguous, leading to no significant differences in OS compared with the PR or PD subgroups. Although fusion models are generally regarded as having superior predictive capabilities, in this study, they only excelled in the SD subgroup compared with both clinical and radiomics models alone. This occurred because the features extracted from the images, when processed by autoML, might yield diagnoses that contradict clinical features, thereby reducing the predictive accuracy of the fusion model.
In the survival analysis, variations in PFS were observed among patients with differing rad scores (P = 0.056), though there was no statistically significant difference in OS. This lack of significance in OS could be due to all patients being in the advanced stages of the disease (cIII–cIV) and exhibiting either lymph node or distant metastasis, both of which are associated with higher risks. In studies with smaller sample sizes and shorter follow-up times, PFS may be a more suitable endpoint than OS, although OS remains the gold standard for measuring clinical benefit.
Furthermore, a positive result in PFS does not always translate to a benefit in OS. This discrepancy can arise because the toxic side effects of a treatment might cause a statistical bias in the PFS assessment, with drugs that have higher side effects potentially showing a “false” PFS advantage during shorter follow-up periods. In this study, the high rad-score group accounted for more than half of the recurrences (median PFS: 13.8 months), whereas the low rad-score group did not reach the median PFS. Median OS was not achieved in either group. The median follow-up time was 20 months, exceeding the median PFS by 6.2 months, which may also indicate robust results.
With the progression in central processing unit and graphics processing unit technology, deep learning and autoML methods have gained popularity.26 In the present study, various algorithms were sequentially employed to develop clinical, radiomics, and fusion models via autoML. Among these, the ensemble models that integrated multiple classifiers demonstrated superior performance. However, the radiomics model, developed using LightGBM, achieved prediction levels in the training cohort comparable to those of the ensemble model. LightGBM is a framework that implements the gradient-boosting decision tree algorithm. This algorithm is well-regarded in machine learning for its ability to iteratively train weak classifiers to derive an optimal model, notable for its efficient parallel training, improved accuracy, and capability to prevent overfitting.27, 28 In response to the characteristics of the dataset, different machine learning algorithms have demonstrated their respective performance advantages. For instance, Wiesweg et al.29 applied support vector machine modeling to analyze RNA expression from biopsy samples in patients with advanced NSCLC, identifying seven genes predictive of immunotherapy response. Similarly, using a cytokine-based ICI response index, Wei et al.30 employed RF modeling to predict responses to ICIs in patients with NSCLC. In the present study, we harnessed autoML to amalgamate multiple algorithms, developing models that exhibited enhanced predictive efficacy. This approach could significantly aid in predicting the effectiveness and survival outcomes of ICI treatment in patients with advanced NSCLC.
The current study has several limitations. First, being a single-center retrospective study with a small sample size in the training cohort, there is a potential impact on the specificity of the models, necessitating the collection of multicenter clinical data to confirm the models’ robustness. Second, CT images were obtained from two scanning devices, which might have an adverse effect on radiomics feature extraction caused by uniformity. The MLJAR platform offers capabilities for model interpretation. As the complexity of the autoML models increases, their interpretability decreases, making it difficult for clinicians to understand and trust the model outputs, which could affect the reliability of model outcomes and the quality of decision-making. Moreover, the assessment of PD-L1 expression was limited by the amount of tissue available for fine-needle biopsy, resulting in some patients not being accurately assessed. It is also crucial in practice to select the most suitable combination of autoML algorithms, tailored to the specific characteristics of the data.
Furthermore, although the primary goal of this study was to provide surgical and oncology specialists with a predictive tool for treatment efficacy in patients with advanced NSCLC, challenges have arisen in accurately identifying lesions on CT images. To address this, Jiang et al.31 developed a multi-scale convolutional NN method that integrates features from different resolutions to segment lung tumors accurately, facilitating the precise and automated tracking of tumor volumes. Integrating similar diagnostic models could enhance the utility of autoML in clinical settings. Moreover, although autoML allows for the training of numerous deep learning models with minimal coding or data input, the performance of these models can vary, and there remains room to improve both efficiency and prediction accuracy. Models that are designed and refined by experts may prove more reliable, and further clarification is needed on their clinical relevance and guidelines for practical diagnosis and treatment.
In conclusion, autoML has the ability to accurately predict the efficacy of immunotherapy and the short-term prognosis of patients with inoperable advanced NSCLC by constructing CT-base radiomics models, aiding the clinical evaluation and screening of a broader population and the development of personalized treatment strategies.
Conflict of interest disclosure
Funding
References
Suplementary Materials
Parameters | GE Discovery CT750 HD | Somatom definition flash |
Tube voltage (kVp) | 120 | 120 |
Tube current (mAs) | 200 | 110 |
Pitch | 0.984:1 | 1.0 |
Collimation (mm) | 0.625*64 | 0.6*64 |
Rotation time (s/rot) | 0.5 | 0.33 |
SFOV (cm) | 50 | 50 |
Slice thickness of reconstruction (mm) | 1.25 | 1 |
Slice interval of reconstruction (mm) | 1.25 | 1 |
Reconstruction algorithm | STND | Medium sharp |
kVp, kilovoltage peak; mAs, milliampere-seconds; SFOV, scan field of view; STND, standard reconstruction algorithm.
Tumor response | All patients (n = 63) |
CR | 0 |
PR | 25 |
SD | 8 |
PD | 30 (47.6%) |
DCR (CR + PR + SD) | 33 (52.4%) |
CR, complete response; PR, partial response; SD, stable disease; PD, progressive disease; DCR, disease control rate.