ABSTRACT
PURPOSE
To comprehensively assess Checklist for Artificial Intelligence in Medical Imaging (CLAIM) adherence in medical imaging artificial intelligence (AI) literature by aggregating data from previous systematic and non-systematic reviews.
METHODS
A systematic search of PubMed, Scopus, and Google Scholar identified reviews using the CLAIM to evaluate medical imaging AI studies. Reviews were analyzed at two levels: review level (33 reviews; 1,458 studies) and study level (421 unique studies from 15 reviews). The CLAIM adherence metrics (scores and compliance rates), baseline characteristics, factors influencing adherence, and critiques of the CLAIM were analyzed.
RESULTS
A review-level analysis of 26 reviews (874 studies) found a weighted mean CLAIM score of 25 [standard deviation (SD): 4] and a median of 26 [interquartile range (IQR): 8; 25th–75th percentiles: 20–28]. In a separate review-level analysis involving 18 reviews (993 studies), the weighted mean CLAIM compliance was 63% (SD: 11%), with a median of 66% (IQR: 4%; 25th–75th percentiles: 63%–67%). A study-level analysis of 421 unique studies published between 1997 and 2024 found a median CLAIM score of 26 (IQR: 6; 25th–75th percentiles: 23–29) and a median compliance of 68% (IQR: 16%; 25th–75th percentiles: 59%–75%). Adherence was independently associated with the journal impact factor quartile, publication year, and specific radiology subfields. After guideline publication, CLAIM compliance improved (P = 0.004). Multiple readers provided an evaluation in 85% (28/33) of reviews, but only 11% (3/28) included a reliability analysis. An item-wise evaluation identified 11 underreported items (missing in ≥50% of studies). Among the 10 identified critiques, the most common were item inapplicability to diverse study types and subjective interpretations of fulfillment.
CONCLUSION
Our two-level analysis revealed considerable reporting gaps, underreported items, factors related to adherence, and common CLAIM critiques, providing actionable insights for researchers and journals to improve transparency, reproducibility, and reporting quality in AI studies.
CLINICAL SIGNIFICANCE
By combining data from systematic and non-systematic reviews on CLAIM adherence, our comprehensive findings may serve as targets to help researchers and journals improve transparency, reproducibility, and reporting quality in AI studies.
Main points
• To our knowledge, no prior research has synthesized data from published reviews on Checklist for Artificial Intelligence in Medical Imaging (CLAIM) adherence, leaving a gap in providing a comprehensive overview independent of disease, technique, or journal.
• Our two-level analysis identified significant reporting gaps in the medical imaging artificial intelligence literature, with a third of CLAIM items omitted, on average.
• Eleven specific CLAIM items were identified as being consistently underreported in the majority of studies, highlighting critical areas for improvement.
• Factors such as the publication year, journal impact quartile, and the radiology subfield influenced CLAIM adherence.
• Reviews assessing CLAIM adherence exhibited variability in their methodologies, with some using scoring and others focusing on compliance, leading to inconsistencies in evaluation and reporting.
With the exponential increase in artificial intelligence (AI) publications related to medical imaging,1 ensuring transparency and reproducibility has become crucial for advancing the field and integrating AI into clinical practice.2-4 To address these needs, various AI-focused reporting guidelines have been introduced,5-7 one of which is the Checklist for Artificial Intelligence in Medical Imaging (CLAIM).8 Published in March 2020, the CLAIM was designed to improve reporting clarity and scientific communication in medical imaging AI.8 Inspired by the Standards for Reporting of Diagnostic Accuracy Studies (STARD) guidelines,9 the original 2020 version of the CLAIM featured a 42-item checklist to help authors and reviewers achieve clear, comprehensive, and reproducible reporting in AI studies. In May 2024, an updated CLAIM was published following a formal Delphi process, refining the checklist to 44 items to address new challenges and developments while retaining the original structure.10 The update included refinements to terminology and revisions to some items. The CLAIM is part of the EQUATOR network, a central hub for reporting guidelines.11
Since its release, the CLAIM has gained widespread attention across multiple medical specialties involving imaging and AI, with over 850 citations in Google Scholar as of January 2025. Despite its popularity, assessments of CLAIM adherence remain highly variable,12-14 often with particular focus on specific diseases,15-18 techniques,19-21 or individual journals.22 A comprehensive assessment of CLAIM adherence across these diverse studies is notably lacking. Such an analysis, previously applied to frameworks such as the Radiomics Quality Score (RQS),23 would reveal the CLAIM’s overall adherence patterns, highlight underreported items, and provide guidance for future revisions beyond the 2024 CLAIM update,10 along with the development of new, alternative AI checklists.
This study aims to comprehensively assess CLAIM adherence in the medical imaging AI literature published to date using a two-level approach: review level and study level. The review-level analysis aggregates data from previous systematic and non-systematic reviews, whereas the study-level analysis examines unique individual papers within these reviews, mostly focusing on checklist items. Furthermore, factors influencing high or low CLAIM adherence are examined at the study level. Finally, critiques of the CLAIM guidelines are systematically analyzed across eligible reviews for both levels.
Methods
Literature search and screening
A literature search was conducted through PubMed, Scopus, and Google Scholar to identify reviews on the application of the CLAIM8 using the syntax “Checklist for Artificial Intelligence in Medical Imaging.” The final search was performed on August 6, 2024. Since the search syntax was simple, we did not use advanced database features to target specific fields (e.g., title, abstract, or keywords). Instead, we used the general search box, which typically searches across all fields in the database entries.
For Google Scholar, the first 100 results were screened based on the filter setting “relevance,” whereas all entries were reviewed in the other two databases. Google Scholar can provide valuable additions to systematic reviews, even when screening is limited to the top 100 results.24 Because its “relevance”-based ranking typically prioritizes the most pertinent articles, this approach was chosen to manage the large volume of results often retrieved from Google Scholar, many of which include duplicates or less relevant entries. Notably, Google Scholar was treated as a supplementary source to mitigate the risk of missing key papers, complementing the more comprehensive searches conducted in PubMed and Scopus, where all entries were reviewed.
Three readers (F.K., A.K., and A.S.; all 3rd- or 4th-year radiology residents) initially screened all records to identify review articles evaluating medical imaging AI studies using the CLAIM (2020 version).8 Records were excluded if they lacked a CLAIM evaluation (2020 version),8 full-text access, and relevance to medical imaging; relied on self-reported data; or had significant overlap with another study. Each reader cross-checked another reader’s results.
Duplicates were removed using Zotero software. The full-text articles and available supplements were downloaded for evaluation by the same three readers, who divided the workload equally. For articles where full-text access was unavailable through our institutional libraries, we tried to reach out directly to the authors to request access.
Eligibility
After the initial screening, articles were evaluated for eligibility by the same three readers under the supervision of a radiology specialist experienced in informatics and AI (B.K.). For the review-level analysis, reviews with adequate adherence data on the 42-item CLAIM were included; those with incomplete or unclear data were excluded. For the study-level analysis, only reviews with 42-item CLAIM data for each study (i.e., a completed checklist for each study) were included. Duplicate and retracted studies, along with the studies with unclear references to their source articles, were removed. Papers using a modified 42-item CLAIM with subsections that retained the main framework were included in the study-level analysis but excluded from the review-level analysis unless CLAIM adherence could be evaluated at that level.
Analyzing data at the individual study level was crucial to gain item-level insights as well as several other baseline characteristics, as this level of granularity could not have been achieved through a review-level-only analysis. Although we acknowledge the potential limitations of using a highly selected sample, this approach was necessary to address the study’s objectives and provide meaningful insights at the desired level of detail.
Data extraction
For the review-level analysis, data extraction was initially performed by a radiology specialist experienced in informatics and AI (B.K.) and was subsequently confirmed by another radiology specialist (M.K.). Extracted data included the review’s scope, radiology subfield, number of studies (or evaluations) in the reviews, online publication year, number of readers, reader independence, decision-making methods, reproducibility analysis, consideration of non-applicable (n/a) items in the adherence evaluation, CLAIM adherence evaluation method, and source of the CLAIM evaluation.
For the study-level analysis, the three radiology residents independently extracted and cross-checked the data. The cross-checking was performed by having the readers review and validate one another’s work. In cases of disagreement, an experienced reader (B.K.) was consulted to resolve the issue. Extracted information included the journal name, publication year, publication type, journal scope and focus, radiology subfield (expanded from the review-level data), journal’s h5-index (from Google Scholar Metrics), 2023 impact factor quartile (2024 release; Journal Citation Reports, Clarivate Analytics, Web of Science Group), and CLAIM adherence by item.
Full-text articles, including the text, figures, tables, and supplements, were reviewed to identify adherence data, including item-specific CLAIM data, organized according to the original item order, if necessary. For adherence data sourced from the reviews, only studies with a clear source attribution were included. In cases of multiple rater evaluations, consensus data were prioritized; if unavailable, one evaluation (the first) was selected. In the study-level analysis, only one assessment per study was included when multiple pipelines were assessed, whereas all assessments were considered in the review-level analysis, which are referred to as “studies” in this research. For studies using a modified CLAIM with subsections within a 42-item framework, an item was considered reported if ≥50% of its subitems were positively evaluated. Partially reported items were classified as reported, in alignment with the common standard checklist format (i.e., reported, not reported, and not applicable).
Two radiology specialists with experience in informatics and AI (B.K. and İ.M.) evaluated the review papers in both the review-level and study-level analyses for critiques about the CLAIM. The PDFs were then screened using Google’s NotebookLM tool, with various targeted prompts to identify additional critiques and to minimize the risk of missing important ones. The results from this additional screening were double-checked by both readers, verified against their sources, and integrated with the initial human evaluation findings.
Adherence metrics
This study applied two commonly used CLAIM adherence metrics: the CLAIM score and CLAIM compliance. The CLAIM score represents the total number of reported items, whereas CLAIM compliance is calculated as the percentage of reported items relative to the total applicable CLAIM items.
For the study-level analysis, these two metrics were calculated directly from the extracted item-level data. In the review-level analysis, metrics were extracted as a mean and used as reported when directly provided; if not, they were derived from tables, figures, or supplementary files where possible, converted from the median and interquartile range (IQR), if necessary, according to the methods proposed by Luo et al.25 and Wan et al.26, or computed as weighted combinations when presented by category.
Statistical analysis
Statistical analysis was conducted using R (main packages: ggstatsplot and Hmisc) and JASP (version 0.19.1; Apple Silicon). Descriptive statistics, including frequency, percentage, mean, standard deviation (SD), median, IQR, and 25th–75th percentiles, were reported based on variable distribution. In the review-level analysis, adherence metrics were weighted by the number of studies or evaluations using the “Hmisc” R package and presented using both the mean and median without considering statistical normality. For the study-level data, normality was tested with the Shapiro–Wilk test, and the associated statistical results are presented accordingly. In addition, differences between continuous variables were assessed using the Mann–Whitney U test or Student’s t-test based on distribution. The Kruskal–Wallis test was applied to compare multiple categories, with Dunn’s post-hoc tests and the Bonferroni correction. Correlations were assessed with Spearman’s rho. Univariable and multivariable logistic regression was performed to identify the potential factors related to high and low CLAIM adherence metrics according to the median. No multiplicity correction was performed in the logistic regression analyses due to the exploratory nature of the study. Statistical significance was set at P < 0.05.
Results
Literature search
Figure 1 summarizes the eligibility process. Finally, 33 eligible reviews encompassing 1,458 study evaluations were included in the review-level analysis. For the study-level analysis, 15 reviews (13 from the previous set and 2 additional reviews) were included, covering 421 unique eligible studies. In total, 35 reviews met the eligibility criteria for both levels of analysis (Table 1).12-22,27-50 The final dataset used in this study is publicly available from the Open Science Framework and can be accessed via the following link: https://osf.io/rx67y/
Baseline characteristics of papers eligible for the review-level analysis
The baseline characteristics of the 33 papers included in the review-level analysis are summarized in Table 2.
Multiple readers conducted CLAIM evaluations in 85% of reviews (28/33), with most assessments (79%, 22/28) performed independently and finalized by consensus (82%, 23/28). A reliability analysis was included in only a few multi-reader studies (11%, 3/28). One study reported an intraclass correlation coefficient (ICC) above 0.87 for inter-observer reliability across task categories.46 Another study found an ICC of 0.815 for inter-observer reliability, with varying kappa values for individual items.14 A third study reported an intra-observer repeatability coefficient of 0.22, which was lower and better than that of other checklists evaluated, except one.31
Figure 2 highlights the consideration of item applicability in the included reviews, along with the resultant metrics from this study. Regarding CLAIM adherence, 55% (18/33) of reviews considered the applicability of items, allowing for the calculation of a CLAIM compliance metric. For approximately 79% (26/33) of the reviews, appropriate data to calculate CLAIM scores were available, although the origin of the scores varied, with only 36% (12/33) providing direct reports.
Adherence based on the review-level analysis
Among the 26 reviews with available CLAIM scores, encompassing 874 studies, the weighted mean CLAIM score was 25 (SD: 4), and the weighted median was 26 (IQR: 8; 25th–75th percentiles: 20–28). For the 18 reviews providing CLAIM compliance data, covering 993 studies, the weighted mean CLAIM compliance was 63% (SD: 11%), with a weighted median of 66% (IQR: 4%; 25th–75th percentiles: 63%–67%).
Baseline characteristics of papers eligible for the study-level analysis
The baseline characteristics of the papers included in the study-level analysis are summarized in Table 3. Publication dates ranged from 1997 to 2024.
Adherence based on the study-level analysis
In the study-level analysis of 421 unique studies, the median CLAIM score was 26 (IQR: 6; 25th–75th percentiles: 23–29), and the median CLAIM compliance was 68% (IQR: 16%; 25th–75th percentiles: 59%–75%). Notably, 11% of the studies (47/421) had a CLAIM score of <21 (i.e., 50% of 42), whereas 10% (40/421) reported a CLAIM compliance of <50%.
Figure 3 illustrates the median CLAIM scores and compliance by journal and publication volume. Among the top 10 journals by publication volume, Radiology had the highest median CLAIM score and compliance rate.
Table 4 presents the results from the univariable and multivariable logistic regression analyses to identify factors linked to high and low CLAIM adherence. In the univariable analysis, the publication year, specific radiology subfields, journal h5-index, and certain impact factor quartiles were associated with the CLAIM score or compliance. In the multivariable analysis, the publication year and impact factor quartile emerged as independent predictors of the CLAIM score and compliance. Specifically, publishing in a first quartile (Q1) journal independently predicted higher CLAIM scores and compliance, whereas second quartile (Q2) journals were associated with higher CLAIM compliance. Certain radiology subfields were additional independent predictors of the CLAIM score.
Figure 4a, b illustrate the correlation between the publication year and CLAIM adherence. Although the CLAIM score did not significantly correlate with the publication year (rho: 0.076, P = 0.117), CLAIM compliance showed a weak but significant positive correlation (rho: 0.119, P = 0.015). Although the CLAIM score did not significantly differ between the pre- and post-CLAIM guideline publication periods (P = 0.153), CLAIM compliance was higher post-publication (P = 0.004) (Figure 4c, d). However, neither the CLAIM score (rho: −0.027, P = 0.697) nor compliance (rho: −0.062, P = 0.365) was statistically significantly correlated with the publication year after the CLAIM guideline publication in 2020.
The CLAIM scores and compliance varied significantly across radiology subfields (P < 0.001 for both), with post-hoc pairwise comparisons showing that the cardiovascular subfield had consistently distinct results compared with others (Figure 5).
The CLAIM scores and compliance also differed by impact factor quartile (P < 0.001 for CLAIM score; P = 0.002 for CLAIM compliance) (Figure 6). The post-hoc analysis revealed that journals in Q1 and Q2 had significantly higher CLAIM scores than non-Web of Science indexed journals or publication platforms. However, CLAIM compliance did not show significant pairwise differences across quartiles.
Moreover, the CLAIM scores and compliance were not statistically significantly different among different publication types, such as journal articles, pre-prints, and conference papers (P > 0.05).
The item-wise CLAIM adherence is presented in Figure 7. Notably, three items were mostly n/a in ≥50% of the papers: item#10 (selection of data subsets, if applicable), item#21 (the level at which partitions are disjoint, e.g., image, study, patient, institution), and item#27 (ensemble techniques, if applicable).
Considering the applicability of the items, the following 11 items were not reported in ≥50% of the papers (i.e., compliance of <50%): item#12 (de-identification methods), item#13 (how missing data were handled), item#19 (intended sample size and how it was determined), item#29 (statistical measures of significance and uncertainty, e.g., confidence intervals), item#31 (methods for explainability or interpretability and how they were validated), item#33 (flow of participants or cases, using a diagram to indicate inclusion and exclusion), item#34 (demographic and clinical characteristics of cases in each partition), item#36 (estimates of diagnostic accuracy and their precision), item#37 (failure analysis of incorrectly classified cases), item#40 (registration number and name of registry), and item#41 (where the full study protocol can be accessed). Figure 8 further highlights the above-mentioned 11 items categorized into three domains: data handling and description, model evaluation and performance, and open science.
The item-wise correlation results for reporting status and year are presented in Table 5, according to pre- and post-publication and post publication of the CLAIM. Considering the entire period, a positive weak-to-moderate and statistically significant reporting trend (rho ≥0.2) was observed for item#19 (intended sample size and how it was determined), item#21 (level at which partitions are disjoint), item#31 (methods for explainability or interpretability and how they were validated), item#33 (flow of participants or cases, using a diagram to indicate inclusion and exclusion), and item#42 (sources of funding and other support; role of funders). Moreover, a negative weak-to-moderate reporting trend (rho ≤−0.2) was observed for item#11 (definitions of data elements, with references to common data elements), item#15 (rationale for choosing the reference standard), item#17 (annotation tools), item#18 (measurement of inter- and intra-rater variability), and item#39 (implications for practice, including the intended use and/or clinical role). Considering the post-publication period, a positive weak-to-moderate reporting trend (rho ≥0.2) was observed in item#10 (selection of data subsets), item#19 (intended sample size and how it was determined), and item#33 (flow of participants or cases, using a diagram to indicate inclusion and exclusion). In addition, a negative weak-to-moderate reporting trend (rho ≤−0.2) was observed for item#9 (data pre-processing steps) and item#39 (implications for practice, including the intended use and/or clinical role).
Critiques in reviews eligible for the entire study
In analyzing the 35 reviews that applied the CLAIM, we identified 10 key critiques, which we organized into 7 categories: fulfillment, applicability, feasibility and practicality, structure, interpretation, relative importance, and scoring. The most common critique was the inapplicability of certain items to all study types. Another frequent issue was the subjective nature of deciding whether an item was sufficiently reported. Table 6 presents all the critiques along with their representative source articles.
Discussion
Main findings and related implications
This study comprehensively evaluated CLAIM adherence in the medical imaging AI literature through a two-level approach: review- and study-level analyses. Considering both analyses, on average, one-third of CLAIM items were inadequately reported, indicating room for improvement in adhering to reporting guidelines. Since adherence was independently assessed rather than self-reported, efforts to improve compliance should focus on improving awareness and engagement among researchers in terms of transparent reporting practices through guidelines. Notwithstanding their well-known benefits,51 recent meta-research shows that radiology, nuclear medicine, and medical imaging journals rarely mandate AI-specific guidelines, despite the CLAIM being the most recommended.52, 53 Journals can actively endorse and promote the CLAIM8 and its updates10 to improve reporting quality and transparency while ensuring proper checklist usage with auditing practices.54, 55
Our correlation analysis revealed a very weak but positive trend between CLAIM compliance and publication year. Although compliance was higher in the post-publication period, the trend was not statistically significant. Long-term follow-up studies on checklists such as STARD have demonstrated slow but significant improvements in research reporting quality over time.56 Although a similar trend was observed in our analysis, more time and data are needed to better understand this progression and assess the CLAIM’s true impact.
We observed that adherence assessments in reviews often lacked consistency due to the absence of standardized methods. We identified two primary approaches, the CLAIM score and CLAIM compliance (%), differing by item applicability. To improve comparability and fairness in the evaluation of adherence, we strongly recommend prioritizing the CLAIM compliance rate over the CLAIM score in future evaluations. The compliance rate accounts for the applicability of individual items, which can vary between studies, thereby providing a more accurate and equitable assessment. Moreover, this approach could be formally recommended or mandated by the developers in future versions of the CLAIM to ensure consistent and standardized adherence evaluations.
Publication year, impact factor quartile, and radiology subfields were key independent predictors of high or low CLAIM adherence. Studies in higher-impact journals (Q1 and Q2) showed stronger adherence, underscoring their role in setting transparent reporting standards and enabling rigorous peer review. However, it should be acknowledged that high-quality research can also be published in lower-impact journals, and high-impact journals are not immune to poor-quality research. Factors contributing to stronger adherence in higher-impact journals may include stricter editorial and peer-review processes, greater visibility of reporting guidelines in these journals, and, potentially, a higher familiarity of authors with these standards. In this respect, encouraging CLAIM adoption, particularly in lower-impact journals, could help enhance reporting transparency and reproducibility. It is important to note, however, that these observations are based on assumptions and warrant further investigation.
In addition, certain subfields, such as cardiovascular imaging, exhibited unique adherence patterns, reflecting differences in the maturity of AI reporting practices. These findings may indicate the need for specific strategies to improve CLAIM adherence across diverse medical imaging subfields and ensure consistent reporting standards throughout the discipline. Further research may be required to investigate whether unique adherence patterns in certain subfields, such as cardiovascular imaging, could be influenced by the contribution of specific authors or research groups.
Eleven items were underreported in ≥50% of studies: de-identification methods (item#12), missing data handling (item#13), sample size determination (item#19), statistical significance and uncertainty (item#29), explainability methods (item#31), participant flow (item#33), demographic data (item#34), diagnostic accuracy estimates (item#36), failure analysis (item#37), registration details (item#40), and protocol access (item#41). This suggests challenges in fulfilling the CLAIM requirements, possibly due to inadequate knowledge, training, resource limitations, or the perceived irrelevance of certain items for specific study types. Interestingly, several of these items reflect broader challenges in AI research, such as securing adequate sample sizes, addressing uncertainty, enhancing model explainability to avoid the “black-box” problem, and promoting principles of open science, even if not explicitly stated. These 11 items, therefore, warrant particular attention when preparing AI manuscripts to improve the overall reporting transparency and rigor of AI research in medical imaging.
From the 35 eligible reviews, several key critiques were identified, including concerns about the inapplicability of certain items to all study types and the subjective nature of reporting decisions. Although the CLAIM 2024 update has addressed applicability by introducing three checklist options and leaving judgment to the evaluators,10 subjective interpretation still remains a significant issue. Notably, our analysis revealed that CLAIM evaluations involved multiple readers in 85% of reviews, but only 11% assessed evaluation reliability, revealing a critical gap. Despite high reported reproducibility, such assessments need improved experimental settings to thoroughly investigate interpretation-related issues, as previously achieved for RQS.57 Additionally, leveraging automated tools, such as those powered by large language models used for RQS,58 might have the potential to help reduce subjectivity and improve consistency.
Based on the other critiques identified, future versions of the CLAIM can also be improved by simplifying definitions and improving clarity, removing subjective items based on reproducibility studies with rigorous analysis, and providing holistic guidance for interpreting manuscripts alongside their code. Additional improvements could include prioritizing items by assigning weights through evidence-based voting methods and developing user-friendly online tools, similar to the METhodological RadiomICs Score (METRICS),59 for an adherence assessment that considers item applicability. These refinements would help streamline CLAIM evaluations and improve their utility for the medical imaging community.
Previous studies
To the best of our knowledge, no research has yet been conducted to evaluate CLAIM adherence by synthesizing data from both systematic and non-systematic reviews, providing a comprehensive overview of the topic. However, similar efforts have been made in the field of radiomics research,23, 60, 61 particularly with the RQS,62 which is widely regarded as the standard for assessing the methodological quality of radiomics studies, although recent alternatives have emerged.59
In 2023, Spadarella et al.60, who first published their research online in 2022, conducted a review-level analysis of 44 reviews. They reported a median RQS of 21%. Later, in late 2024, Kocak et al.23 deepened the analysis by performing a study-level analysis of 1,574 unique papers from 89 reviews, finding a median RQS of 31%. In 2025, in another very recent coincidental and independent study, Barry et al.61 conducted a multi-level meta-analysis of 3,258 RQS assessments from 130 systematic reviews as a continuation of the earlier study by Spadarella et al.60, reporting an overall mean RQS of 9.4 ± 6.4 (95% confidence interval, 9.1–9.6) [26.1% ± 17.8% (25.3%–26.7%)]. It is important to note, however, that these RQS scores are not directly comparable to CLAIM adherence, as the two tools serve different purposes: RQS assesses the methodological quality of radiomics research, whereas the CLAIM focuses on reporting the quality of medical imaging AI research.
Furthermore, our results can be compared with those reported in the studies synthesized for this research.12-22,27-50 In the review articles evaluated in the review-level analysis, the raw CLAIM scores ranged from 20 to 40, whereas the CLAIM adherence rates differed widely between 41% and 81%. This considerable variability underscores the inconsistent adherence to the CLAIM observed across the literature, highlighting the critical importance of our study in addressing these gaps.
Strengths and limitations
This study provides several strengths with notable implications for evaluating AI reporting quality in medical imaging. First, integrating data from multiple reviews offers a comprehensive assessment, unlike topic-specific studies, and provides a generalizable understanding of reporting practices. Second, our two-step analysis delivers both a broad overview and detailed insights, enabling item-wise evaluation to pinpoint areas needing particular improvement. Third, we identified factors associated with CLAIM adherence, offering actionable insights for enhancing reporting standards. Fourth, we presented two adherence metrics (the CLAIM score and compliance), facilitating comparability with other studies and setting a benchmark for future research. Finally, our analysis of critiques from eligible reviews offers valuable feedback to guide future updates to the CLAIM guidelines beyond 2024 and new alternative AI checklists.10
Our study has several limitations that should be carefully considered when interpreting the results. First, this study was not registered (e.g., in PROSPERO). This decision was due to the unique nature of conducting a collective review of previous reviews of the CLAIM. Given the limited number of studies employing a similar strategy, and despite our group’s experience with other guidelines, the methodology required adaptations based on the challenges and limitations encountered during data collection and analysis. These evolving methodological adjustments made it difficult to provide a fully transparent outline of the approach at the outset. Second, this research was limited to three databases, PubMed, Scopus, and Google Scholar, which we selected based on their broad coverage and relevance to the field, according to our experience. However, we acknowledge that the inclusion of additional databases, such as Embase and Web of Science, could further improve the comprehensiveness of the search. Third, the assessment of reporting quality was based solely on the CLAIM (2020 version). In the future, other AI-specific reporting guidelines, such as CONSORT-AI and TRIPOD-AI, could be considered to provide a more comprehensive evaluation of reporting standards.63 Fourth, many articles were published before the CLAIM guidelines were introduced in 2020. However, the goal of this study was to highlight the overall state of reporting quality in the field, with some analyses covering both pre- and post-guideline periods. Fifth, our analysis focused solely on reporting quality and did not include evaluating the studies’ actual impact, such as citation counts; there may not yet have been sufficient time for recent studies to have accumulated citations for meaningful comparisons. Additionally, the scope of our study is limited to exploring other factors that could affect the clinical translation of AI, such as methodological quality. Evaluating these factors may require supplementary tools, such as METRICS.59 Sixth, this study was conducted after the CLAIM 2024 update.10 Although the main framework of the original CLAIM was preserved,8 earlier findings might have better informed the current update but could still aid future revisions and new guidelines. Seventh, the results of this study rely on prior systematic and non-systematic reviews as well as the expertise of the evaluators involved in those studies. The potential limited familiarity with certain aspects of the CLAIM in those articles and inconsistencies may have influenced the findings of this study. Eighth, due to the lack of a standard checkbox format in the initial CLAIM, consideration of item applicability may vary among reviews, potentially influencing adherence results, although both the CLAIM score and CLAIM compliance were assessed in the two-level analysis. Ninth, extracting data from systematic reviews can be subjective and may vary depending on the readers’ experience. To minimize potential errors, we implemented a rigorous process involving the cross-checking of extracted data and resolving disagreements through consensus or by consulting an experienced reader, when necessary, at different stages of the study. Finally, the number of studies included in the study-level analysis was smaller than the number of studies represented in the review articles analyzed at the review level. However, to gain item-level insights, it was essential to conduct the analysis at the individual study level, as this granularity could not have been achieved at the review level. The sample size for the study-level analysis was determined merely by the availability of data in the existing literature, which may have introduced some degree of bias. Therefore, the findings should be interpreted with this limitation in mind.
In conclusion, this study provides a comprehensive evaluation of CLAIM adherence in the medical imaging AI literature, revealing significant variability and highlighting areas for improvement. Our two-level analysis, encompassing review- and study-level data, identified substantial reporting gaps, with a third of checklist items often omitted. Factors such as publication year, journal impact quartiles, and subfield-specific differences emerged as key independent predictors of adherence, underscoring the role of high-impact journals and tailored strategies for different subfields. The CLAIM compliance rate was highlighted as a more objective and fairer metric for adherence assessment. Additionally, several important critiques of the CLAIM were identified, providing valuable insights for researchers and developers. We hope these findings serve as actionable guidance for the scientific community to enhance transparency, reproducibility, and reporting quality in AI studies.