Radiomics facilitates the extraction of vast quantities of quantitative data from medical images, which can substantially aid in several diagnostic and prognostic tasks.1 Although numerous studies have demonstrated promising results with this approach, its integration into clinical practice remains limited, necessitating additional validation for clinical application.2 A major barrier to this integration is the lack of standardization of key stages in the complex multi-step radiomic pipeline,3 which could be assessed and enhanced through structured guidelines and quality assessment tools.4-7
In 2017, Lambin et al.8 introduced the radiomics quality score (RQS) as a methodological assessment tool to enhance the quality of radiomics studies. The RQS comprises 16 items that evaluate the entire lifecycle of radiomics research, with a total raw score ranging from −8 to +36. Although the rationale for the scores assigned to each item remains unclear, the radiomics research community has widely adopted this tool since its introduction, leading to numerous systematic reviews.9 The success of the RQS within the research community also signifies a strong desire for standardization in radiomics, despite its limitations.
Recently, new consensus guidelines specific to radiomics research, namely, the CheckList for EvaluAtion of Radiomics Research (CLEAR) and the METhodological RadiomICs Score (METRICS), have been introduced and endorsed by leading imaging societies.6, 7 CLEAR aims to promote transparent reporting practices, whereas METRICS provides a standardized tool for assessing the methodological quality of radiomics research. METRICS includes 30 items spread over five conditions, designed to accommodate almost all potential methodological scenarios in radiomics research, from traditional handcrafted methods to advanced deep-learning computer vision models.6 The development process for METRICS involved a modified Delphi method and a broad international panel to mitigate bias and focus on specific aspects of radiomics research related to medical imaging. The European Society of Medical Imaging Informatics has endorsed the METRICS tool, and its website offers an online calculator for the final quality score, which also considers item conditionality (available online at https://metricsscore.github.io/metrics/METRICS.html).6
Published in 2024,6 METRICS is just beginning its journey, and its differences from RQS have not yet been fully explored, which could offer valuable insights for the radiomics community. Therefore, we aimed to compare METRICS and RQS through hypothetical examples, focusing on the unique or missing items of each quality scoring tool. For this comparison, the methodological quality of an ideal hypothetical study was defined as achieving a score of 100% using one tool before being assessed using the other tool, and vice versa. For simplicity, all conditions of METRICS were deemed fulfilled (i.e., scored as “yes”) in both comparisons. To establish a baseline, we assumed that a perfect study meets only the minimum requirements of a quality scoring tool (either RQS or METRICS) to attain the highest possible score. This assumption allowed us to evaluate the probable lowest boundary of the highest potential score achievable by the alternative tool. Following the conventions in the literature and recommendations by its developers, the RQS percentage score was calculated by dividing the total points by 36 and multiplying by 100. We also examined the scaling method used for RQS in the literature compared with that of METRICS.
The upper panels of Figure 1 clearly depict a comparison of final quality scores using alternative tools in these hypothetical scenarios. A hypothetical perfect study based on RQS could only achieve a 30% score, which means it lacks up to 70% of the total METRICS percentage score. Conversely, a hypothetical perfect study based on METRICS could reach a 42% score, thus missing 58% of the potential RQS percentage score. Notably, the hypothetical perfect study based on METRICS achieved a higher score in the RQS (42% or 15 total points) compared with the study based on RQS (METRICS: 30%). In the scenario where the perfect study adheres to RQS standards (i.e., RQS: 100%), the requirements for 20 of the 30 items (67%) were not fully met in the METRICS tool. Conversely, in the scenario where METRICS is the standard (i.e., METRICS: 100%), 12 of the 16 (75%) RQS items were not satisfied. Of these, 9 had no direct counterpart in the other tool, whereas the remaining 3 were only partially covered. The lower panels of Figure 1 provide further details about the item-wise comparison in these hypothetical scenarios. Additionally, the items missed in the alternative tools are comprehensively listed in Table 1.
In a perfect study based on RQS, the METRICS evaluation revealed numerous missing items that span almost all sections of the tool, with some sections completely lacking coverage: “study design,” “segmentation,” “image processing and feature extraction,” and “preparation for modeling.” The “study design” section of METRICS places substantial emphasis on transparent reporting practices and encourages adherence to specific guidelines tailored to radiomics, such as CLEAR.7 These METRICS items also highlight crucial aspects of any experimental setup, including the accurate reporting of patient eligibility criteria and reference standards. The “segmentation” section emphasizes the important but often overlooked nuances of data labeling methodology. These include the formal evaluation of fully automatic segmentation (when employed) and the clinical applicability of the segmentation methodology. Specifically, if masks are required for the test set to simulate real-world inference, they should mirror what would reasonably be expected in this context (i.e., produced by a single reader or automated software). “Image processing and feature extraction” considers standardization initiatives such as the Image Biomarker Standardization Initiative, as well as the transparency and appropriateness of settings used in data preprocessing and feature extraction.5 The items in “preparation for modeling” address key sources of bias, such as proper data partitioning to prevent information leakage during model development and the handling of confounders. Importantly, missed items extend beyond these sections. For instance, METRICS emphasizes the importance of model availability in the “open science” section, which is critical for validating proposed approaches with new data, ideally from a diverse source.
In the same vein, METRICS has not addressed several RQS items. While theoretically possible, certain RQS items such as “phantom study,” “multiple time points,” “biological correlates,” and “prospective study” may be deemed too abstract or lack practical relevance to necessitate their systematic inclusion in every radiomics study.10 Interestingly, the “prospective study” was initially considered and voted on during the development of METRICS but failed to reach the consensus threshold for inclusion in the final scoring tool. Likewise, other items were proposed by participants during the METRICS development phase but were excluded from the final tool following open and anonymous discussions throughout the Delphi process, indicating a general consensus on their limited utility. For additional METRICS and RQS items not discussed here, please refer to Table 1.
Although METRICS presents the final score as a percentage value with linear scaling, the RQS does not advocate for this method when converting total RQS points to a percentage. A re-analysis of the papers in the seminal study by Spadarella et al.9, which included 44 systematic reviews using RQS, revealed that 32 used non-linear scaling (i.e., total points/36*100), and none used linear scaling (i.e., [total points + 8]/44*100). Despite questions about the appropriateness of the non-linear conversion method, this practice follows the developer’s suggestion (i.e., 36 = 100%).8 This method of calculation does not account for negative values in scaling, where both scores of −8 and 0 correspond to 0%, potentially overestimating the score of studies with negative RQS totals. This could lead to the impression that the absence of “feature reduction or adjustment for multiple testing” and “validation” renders the remaining methodological points unsubstantial until an overall positive score is achieved, possibly underestimating the quality of studies on the percentage scale. The upper panel of Figure 2 illustrates a simple comparison of RQS percentage calculations by the widely used non-linear method versus the linear method. The lower panel of Figure 2 shows the impact of using the non-linear method compared with the linear method. This simulation demonstrates that the non-linear method tends to underestimate the final RQS percentage, with a mean, standard deviation, and maximum of −8.9%, 5.4%, and 18%, respectively.
In this brief article, we aimed to draw the scientific community’s attention to the differences between two quality scoring tools for radiomics research, specifically the recently published METRICS and the well-established RQS. Given the absence of an independent reference standard, which would provide invaluable additional insights, we relied on hypothetical perfect studies to evaluate these tools’ relative value and content. Although this approach was hypothetical, it underscored the distinct focus of each tool on different aspects of the radiomic pipeline, given the substantial disparity in relative scores and missed items. Therefore, a direct comparison of the scores from these tools is not feasible, and researchers should consider the unique features of each tool. Based on the insights from this analysis and the emerging limitations regarding the reproducibility and accuracy of the RQS percentage score,9, 10 METRICS may be the preferable choice if only one tool is to be used.
Conflict of interest disclosure
Burak Koçak, MD, Tugba Akinci D’Antonoli are Section Editors in Diagnostic and Interventional Radiology. They had no involvement in the peer-review of this article and had no access to information regarding its peer-review. Burak Koçak, Tugba Akinci D’Antonoli, and Renato Cuocolo took part in the development of METRICS.