ABSTRACT
CONCLUSION
The segmentation-based reproducibility of radiomic features appears to be substantially influenced by discretization and resampling parameters. Our findings indicate that the bin width method should be used for discretization and lower bin width and higher resampling values should be used to allow more reproducible features.
RESULTS
Image preprocessing parameters had a significant impact on the segmentation-based reproducibility of radiomic features. The bin width method yielded more reproducible features than the bin count method. In discretization experiments using the bin width on both sequences, according to the ICC cut-off values of 0.75 and 0.90, the rate of reproducible features ranged from 70% to 84% and from 35% to 57%, respectively, with an increasing percentage trend as parameter values decreased (from 84 to 5 for T2; 100 to 6 for T1ce). In the resampling experiments, these ranged from 53% to 74% and from 10% to 20%, respectively, with an increasing percentage trend from lower to higher parameter values (physical voxel size; from 1 x 1 x 1 to 2 x 2 x 2 mm3).
METHODS
The MRI scans of 50 patients were included from the multi-institutional Brain Tumor Segmentation 2021 public glioma dataset. Whole tumor volumes were manually segmented by two independent readers, with the participation of eight readers. Radiomic features were extracted from two sequences: T2-weighted (T2) and contrast-enhanced T1-weighted (T1ce). Two methods were considered for discretization: bin count (i.e., relative discretization) and bin width (i.e., absolute discretization). Ten discretization (five for each method) and five resampling parameters were varied while other parameters were fixed. The intraclass correlation coefficient (ICC) was used for reliability analysis based on two commonly used cut-off values (0.75 and 0.90).
PURPOSE
To systematically investigate the impact of image preprocessing parameters on the segmentation-based reproducibility of magnetic resonance imaging (MRI) radiomic features.
Main points
• Variations of image preprocessing parameters, regarding discretization and resampling, have a significant impact on the segmentation-based reproducibility of radiomic features.
• The bin width method yields more reproducible features than the bin count method for discretization.
• Using lower bin width values and higher resampling values could help produce more reproducible features.
• The optimal preprocessing parameters should be determined within the radiomic pipeline.
• To allow replication, preprocessing parameters should be transparently reported in radiomic publications due to their importance.
Radiomics is a field of medical image analysis that enables the digital decoding of images into high-throughput quantitative features.1 Medical images may contain hidden patterns, indicating the underlying pathophysiology of the examined tissue. Based on this assumption, radiomic features derived from these images might help characterize tissues and guide clinical decision-making.1, 2 Support for this notion has arisen from numerous studies that have addressed the capability of radiomics in making predictions regarding different clinical endpoints.3 There has been an exponential increase in publications related to radiomics, with a yearly growth rate of 19.6% and a doubling time of 3.9 years.4 However, reproducing and validating published studies is still challenging due to a lack of standardized definitions, parameter settings, and inadequate reporting.5, 6, 7, 8, 9
Before implementing radiomics in clinical practice, it is necessary to have a thorough understanding of the reproducibility of radiomic features. Many previous publications have emphasized the dependency of radiomic features on different factors, such as temporal variability,10, 11 scanning parameters,12, 13, 14 delineation uncertainty,15, 16 reconstruction algorithms,17 preprocessing,8 and organ motion.18 The absolute value and statistical distribution of the radiomics features are considerably affected by the aforementioned determinants, which in turn affects the robustness and generalizability of any subsequent analysis derived from these features. To overcome this divergence, the Image Biomarker Standardization Initiative (IBSI) attempted to standardize the radiomic feature extraction process, focusing on the issues of repeatability, reproducibility, and validation in quantitative image analysis and radiomics.5 According to this initiative, standardized image processing should be performed before radiomic feature extraction.5 Nonetheless, no specific processing parameter settings have been published to date, which underlines the requirement for additional research.8, 19, 20
One of the most important steps in the radiomic pipeline that affects reproducibility is segmentation or delineation.21, 22 For example, a feature might be highly reproducible in a test–retest setting, but there is no guarantee that this feature will be robust after segmentation. Segmentation-based reproducibility analysis is extensively used to reduce the high dimensionality of radiomics data as a data handling step for subsequent predictive modeling procedures.2, 23 However, only a limited number of studies have focused on the impact of preprocessing settings on segmentation-based feature reproducibility.24, 25 Duron et al.24 studied magnetic resonance imaging (MRI)-based radiomic features of lachrymal gland tumors and breast lesions with a focus on discretization techniques. Lu et al.25 investigated positron emission tomography/computed tomography (PET/CT)-based radiomic features in patients with nasopharyngeal carcinoma, again with a focus on discretization. No research has specifically assessed the impact of both image voxel resampling and gray-level discretization on the segmentation-based reproducibility of the radiomic features. However, these two preprocessing methods are frequently encountered in radiomic feature extraction software tools.
The purpose of this study was to systematically investigate the effect of image preprocessing parameters on the segmentation-based reproducibility of radiomic features from MRI and to recommend reasonable parameter settings for achieving highly reproducible features.
Methods
Figure 1 depicts the key study steps to help readers understand the methodology.
Results
Figure 2presents the distribution of the ICC estimates for various preprocessing processes, including discretization with bin count, discretization with bin width, and voxel resampling. Detailed descriptive statistics of the ICC estimates based on preprocessing processes are presented in Table 1. As the bin width was reduced in the experiments, the mean ICC values increased. In experiments involving the bin count, the mean ICC values increased as the bin count increased. Both tests revealed that an increase in the number of gray levels led to an increase in the mean ICC values and, in turn, the segmentation-based reproducibility of radiomic features. The mean ICC values were statistically significantly different and higher in the bin width group (for T1ce, mean ± SD, 0.855 ± 0.158; for T2, mean ± SD, 0.818 ± 0.169) than in the bin count group (for T1ce, mean ± SD, 0.729 ± 0.196; for T2, mean ± SD, 0.713 ± 0.180) on both of the T1ce [t(2, 764) = −28.2, P < 0.001] and T2 [t(2, 764) = −22.3, P < 0.001] sequences. For the resampling, the mean ICC values improved as the resampled physical voxel size increased.
Table 2 presents the ANOVA findings for parameter differences across experimental groups. Although the effect sizes were minor (range: 0.002–0.029), all comparisons for all three preprocessing experiments were statistically significant (P < 0.001 for all experiments in both sequences). The statistically significant pairs following the post-hoc Tukey test are summarized in Table 3. Considering all evaluations based on sequence and preprocessing experiments, there were statistically significant differences at least between all minimum and maximum numeric values of the preprocessing parameters (e.g., bin count of 8 vs. 128; resampling 1 x 1 x 1 vs. 2 x 2 x 2 mm3).
Figures 3 and 4 depict the percentages of features with good and excellent reproducibility in the discretization and resampling experiments, based on two typical ICC cut-off values (0.75 and 0.90). In the discretization experiments with bin count on both sequences, taking the ICC cut-offs of 0.75 and 0.90 into account, the rate of reproducible features was 36%–69% and 9%–19%, respectively, with an increasing percentage trend from lower parameter values to higher parameter values. In the discretization experiments with bin width on two sequences, with the ICC cut-off values of 0.75 and 0.90, the rate of reproducible features was 70%–84% and 35%–57%, respectively, with an increasing percentage trend as parameter values decreased. In resampling experiments on both sequences, with the ICC cut-off values of 0.75 and 0.90, the rate of reproducible features was 53%–74% and 10%–20%, respectively, with an increasing percentage trend from lower to higher parameter values.
Given a fixed first-order range in a sequence calculated based on the dataset, the bin width experiments outperformed the respective bin count (e.g., for T1ce, a bin count of 128 vs. a bin width of 6) in terms of the percentages of features with good (ICC ≥0.75) and excellent (ICC ≥0.90) reproducibility in all comparisons, with statistically significant distributional differences (Table 4).
Figures 5 and 6 for the T1ce sequence and Supplementary Figures S1 and S2 for the T2 sequence depict the reproducibility of radiomic features according to the feature classes and image types from which they were extracted. In the qualitative evaluation of these bar charts, there was no major trend deviation other than the original image against the general trend.
Discussion
In this study, we systematically investigated the influence of image preprocessing parameters (i.e., discretization and resampling) on the segmentation-based reproducibility of MRI radiomic features and found a significant impact. The bin width method yielded more reliable features than the bin count method. Using lower bin width values and higher resampling values produced more reproducible features.
Several studies have evaluated the influence of preprocessing and segmentation independently,34 neglecting their influence on each other to a large extent. To our knowledge, very few studies have focused on the impact of preprocessing settings on segmentation-based reproducibility.24, 25 Additionally, no research has specifically assessed the impact of both image voxel resampling and gray-level discretization on the segmentation-based reproducibility of radiomic features.
Duron et al.24 studied two independent MRI datasets of lachrymal gland tumors and breast lesions from two different centers, with two-dimensional delineations for each dataset. They evaluated six absolute (i.e., fixed bin width method) and eight relative (i.e., bin count method) discretization parameters and studied the distribution and highest number of replicable features for each technique. In addition, they utilized computer-generated delineations that were indicative of inter-observer variability. They observed that the discretization approach had a direct impact on feature repeatability, independent of observers, software, or method of delineation (simulated vs. human). Absolute discretization (i.e., the fixed bin width method) was recommended because it consistently produced statistically considerably more reproducible features than relative discretization. Large bin numbers or narrow bin widths produced the highest number of repeatable features in all experiments. They also underlined that, regardless of the selected method, detailed documentation is vital so that results can be replicated. Although the tumors and range of parameters were completely different in our study from those of Duron et al.24, we observed similar trends in discretization experiments that confirmed and supported each other. Conversely, the most recent guidelines released by the IBSI,5 and a recent seminal phantom study,8 recommend relative discretization techniques (i.e., the bin count method) across disparate acquisitions. Despite the recommendations, some other studies have shown that the relative discretization method might not be the optimal technique.24
Lu et al.25 investigated the robustness of PET/CT-based radiomic features in terms of segmentation and discretization and conducted experiments to study them in patients with nasopharyngeal carcinomas. In total, 50%–63% of their features had an ICC ≥0.8 for the segmentation experiments, whereas 21%–23% of features showed an ICC ≥0.8 for the discretization experiments. However, only 6 of 57 features (11%) had an ICC ≥0.8 for the simultaneous evaluation of both segmentation and discretization experiments. Although Lu et al.25 used a methodology that was quite different from ours, their study was indeed successful in showing the impact of discretization on the segmentation-based reproducibility of the radiomic features.
Unlike the above-mentioned studies, we additionally experimented with resampling parameters and discovered that increasing resampling size resulted in improved segmentation-based reproducibility rates. This additional finding on resampling is contradictory to the studies on the phantom experiments regarding the reproducibility of radiomic feature values. For instance, in a very recent phantom study, Wichtmann et al.8 recommended that resampled voxels should not be too far from the original voxel size regarding feature reproducibility.
Our experiments and previous studies indicate that both discretization and resampling parameters significantly impact the segmentation-based reproducibility of radiomic features, and the optimal parameters to achieve high reproducibility in feature values and segmentation-based reproducibility seem contradictory. For this reason, care should be taken to find the optimal parameters to achieve both feature value reproducibility and segmentation-wise reproducible features within the radiomic pipeline.
This study has several differences when compared with previous studies. First, the number of features was higher than that of previous studies and was as high as those in radiomics research publications that had a clinical purpose. Second, the analysis was not limited to discretization but included experiments regarding resampling. These two preprocessing options commonly appear in open-source feature extraction software programs. Third, the experiments were performed in a different pathology (i.e., glioma), expanding the knowledge of the impact of preprocessing on segmentation-based reproducibility of radiomic features.
The public annotation dataset of BraTS 2021 was not used in the reproducibility experiments of this study because those data were based on a fusion of resultant annotations from several automated methods, first using the simultaneous truth and performance level estimation algorithm, followed by corrections applied by experts.28 It would be difficult to perform and replicate the reproducibility experiments based on the public dataset, which may also not be representative of radiomics publications in general (not specifically those on gliomas) because those papers assessing segmentation reproducibility generally include at least two individual readers. For this reason, we segmented the dataset included in this study ourselves using the whole tumor volume to truly represent the segmentation-based reproducibility step of the radiomic studies.
Our experiments provided several practical points that might be considered in radiomic pipelines, associated publications, and clinical applications. First, image processing including discretization and voxel resampling has a considerable impact on the segmentation-based reproducibility of radiomic features; this should be considered as a means of improving the reproducibility of radiomic features that will be input to the following modeling stages. Second, the bin width method provided more reliable features than the bin count method in terms of segmentation-based reproducibility. Therefore, the bin width method should be favored in clinical studies. Third, using lower values for the bin width and higher values for the resampling provided more reproducible features. Given that there has been a lack of standardized preprocessing settings for discretization and resampling in the literature, these findings might provide guidance for end-users of the radiomic feature extraction tools. Fourth, due to their influence on the generation of reproducible inputs for modeling, our findings indicate that the preprocessing methods and their parameters must be defined in detail in published articles for radiomics models to be reliable.35 According to a recent study, these essential radiomic parameters have been usually poorly reported in publications.7 The recently published Checklist for Evaluation of Radiomics Research has also drawn attention to the same reporting issues.9
Our findings in this study should be interpreted with the following limitations.
First, the protocol for the acquisition of the BraTS 2021 challenge is not entirely clear. It is necessary to conduct research into the influence of the acquisition protocol (e.g., scanner type or acquisition settings) on image properties to gain a deeper comprehension of the behavior of radiomic features.
Second, our research was limited to a single imaging modality, two sequences, manual three-dimensional segmentation, a single tumor pathology, and gross tumor volume to remain manageable, considering the number of experiments conducted. However, we should acknowledge that every one of the aforementioned limitations may hamper the generalizability of the findings. We could also have added other alternatives to this study; however, that may have unnecessarily increased the complexity and workload, which was already high. This study aimed primarily to bring the attention of the radiomics community to the sensitivity of segmentation-based reproducibility to slight changes in two common preprocessing methods and offer reasonable settings. Alternative factors, such as different tumors, other MRI sequences, and different segmentation techniques, should be investigated as part of ongoing research.
Third, although significant and recommended by the IBSI guidelines,5 the preprocessing techniques utilized in this study were only representative of a subset of the available options. However, the methods we used are available on the user interface of nearly all open-source radiomic feature extraction tools. The issue of standardization in radiomic studies may also involve scanner performance, acquisition protocols, acquisition sequence parameters, and data analysis techniques. However, we believe that the results of our study could be a step toward the standardization of radiomics.
Fourth, in our resampling experiments, the bin count was fixed. In light of the pair-wise comparison experiments that were conducted with the final number of gray levels fixed, we anticipate observing a similar pattern when employing the bin width method. Additionally, when resampling images, we performed downsampling, as there has been no clear evidence on whether upsampling or downsampling methods are preferable.2, 5, 8 However, although we considered the use of upsampling to be counterintuitive due to the addition of new voxels, it should be further explored in future experiments.
Fifth, the optimal settings for image processing to achieve the highest proportion of reproducible features were specific to the configuration used in this study. Our objective was not to identify absolute optimal values for all combinations of preprocessing settings. Consequently, no definitive conclusions should be drawn regarding the absolute best parameters (because, for example, they may be beyond the range of parameters used in the experiments) or the optimal sequence and discriminative performance.
Sixth, we did not test semi-automated or automated procedures in this study. Even with such techniques, a human touch or consensus segmentation is usually needed for correction, necessitating an analysis of feature reproducibility for segmentation, and supporting the need for conducting such a study.
In conclusion, to improve and standardize radiomic applications, every potential dependency of radiomic features on various parts of the radiomic workflow should be considered while developing a clinical or research project. In this study, the effect of image preprocessing parameters on the segmentation-based reproducibility of radiomic features from MRI was investigated systematically. Variations of image processing parameters related to discretization and resampling had a significant impact on the segmentation-based reproducibility of radiomic features within the scope of this study, regardless of MRI sequences. In terms of segmentation-based reproducibility, the bin width method yielded more reliable features than the bin count method. Using lower bin width values and higher resampling values produced more reproducible features. We recommend that these processing parameters be determined within the radiomic pipeline and transparently reported in radiomic publications. We anticipate that the implementation of our recommendations may facilitate the selection of more reproducible features and enhance the translation and generalizability of radiomics analyses. Considering the radiomics reproducibility crisis, extensive reproducibility studies are required before radiomics can be reliably implemented in routine clinical practice.
Dataset
In this study, we used the Brain Tumor Segmentation (BraTS) 2021 public glioma dataset,26, 27, 28 which does not require local ethical approval. The MRI data for the BraTS 2021 challenge were collected using various clinical protocols and scanners from a variety of data-contributing institutions. There were four MRI sequences in the dataset: T1-weighted (T1), T2-weighted (T2), contrast-enhanced T1-weighted (T1ce), and fluid-attenuated inversion recovery (FLAIR). All BraTS MRI scans underwent standardized preprocessing, which included the conversion of Digital Imaging and Communications in Medicine-format files to Neuroimaging Informatics Technology Initiative format, co-registration to the same anatomical template (SRI24),29 isotropic voxel resampling (1 x 1 x 1 mm3), and skull-stripping.30
For this reproducibility study, 50 patients with gliomas were randomly selected. Patient identifiers are provided in the Supplementary Table S1. Readers who performed the segmentation used all four sequences. Only two sequences-T2 and T1ce-were used for the preprocessing experiments to assess the dependency of the results on the different sequences; the use of more sequences may have become unfeasible considering the workload and complexity of the study. The T2 sequence was selected to represent the outermost boundary of the tumor, and T1ce was used to evaluate the radiomic features on a different image contrast, considering the relatively homogeneous appearance of glial tumors in T2 compared with T1ce.
Segmentation
The glial tumors were manually segmented using 3D Slicer software v4.11. The pathological high signal intensity that appears in T2 and FLAIR sequences was used to segment the entire tumor volume. Readers were also free to use any of the four sequences available in the dataset to determine tumor borders (T1, T2, FLAIR, and T1ce). Figure 1 also illustrates the segmentation approach.
The segmentation process involved eight readers (three radiologists and five radiology residents), with two readers (one radiology specialist and one radiology resident) for each patient. All of the specialists worked in the neuroradiology division. Two of these had ≥3 years and one had ≥1 years of experience in neuroimaging as a specialist. During the study, all of the residents were in their second or third year in radiology and on their first neuroradiology rotation.
Preprocessing
All images were normalized to a scale of 100 based on the mean and standard deviation (SD) of voxel intensity values. To avoid negative values, the voxel arrays were shifted by 300.
Experiments were conducted by changing the discretization and resampling parameters. For discretization, two methods were considered: bin count (i.e., relative discretization) and bin width (i.e., absolute discretization). The following preprocessing parameters were used for bin count: 8, 16, 32, 64, and 128. For the bin width method, the following preprocessing settings were used for T1ce: 6, 13, 25, 50, and 100; for T2: 5, 11, 21, 42, and 84. The bin width values were determined based on the first-order range in the dataset to get an approximately equal number of gray levels compared with the bin count approach. When experimenting with the above-mentioned two discretization approaches, the resampling parameter was fixed to 1 x 1 x 1 mm3. For resampling, the physical voxel sizes were rescaled to 1 x 1 x 1, 1.25 x 1.25 x 1.25, 1.5 x 1.5 x 1.5, 1.75 x 1.75 x 1.75, and 2 x 2 x 2 mm3. When performing the resampling experiments, the discretization parameter was fixed to a bin count of 32.
Feature extraction
Three-dimensional radiomic features, including shape and texture, were extracted in batch mode using the PyRadiomics open-source software environment (PyRadiomics v3.0.1; NumPy v1.23.5; SimpleITK v2.3.0; PyWavelet v1.4.1; Python 3.10.12).31 The total number of features in each sequence was 1.106. Original, Laplacian of Gaussian (LoG)-filtered, and wavelet-transformed images were used in the feature extraction. The LoG filtering was performed with sigma values of 2, 4, and 6 mm, corresponding to fine, medium, and coarse patterns. The main feature classes were shape, first order, gray-level co-occurrence matrix, gray-level size zone matrix, gray-level run-length matrix, gray-level dependence matrix, and neighboring gray-tone difference matrix.
Statistical analysis
The R v4.3 (rstatix v0.7.2) and Python v3.7 (pingouin v0.5.2) software packages were utilized to conduct statistical analyses. To measure feature reproducibility, the intraclass correlation coefficient (ICC) was estimated based on two-way random effects, absolute agreement, and single measurement, under the Shrout and Fleiss convention.32 The interpretation scale for the ICC was as follows: ICC <0.50, poor; 0.50≤ ICC <0.75, moderate; 0.75≤ ICC <0.90, good; and ICC ≥0.90, excellent.33 Two thresholds-0.75 and 0.90-were used to report the percentages of reproducible features. The normality of the ICC values was determined using the Shapiro–Wilk test. Depending on the group distributions, paired tests, notably the one-way repeated measures analysis of variance (ANOVA) and the student t-test, were used to evaluate statistical differences in continuous variables for all and pair-wise comparisons, respectively. McNemar’s test was utilized to compare the distribution of categorical variables (i.e., reproducible vs. non-reproducible features based on ICC cut-off values). Statistical results were considered significant if P values were ≤0.05. Multiple comparisons were subjected to multiplicity correction using the Tukey test or Bonferroni correction as appropriate. In these comparisons, statistical significance was determined based on adjusted or unadjusted but corrected P values, for the Tukey test and Bonferroni correction, respectively.