Reporting checklist for foundation and large language models in medical research (REFINE): an international consensus guideline

Ismail Mese; Tugba Akinci D’Antonoli; Christian Bluethgen; Keno Bressem; Renato Cuocolo; Akshay Chaudhari; Ali S. Tejani; Amanda Isaac; Andrea Ponsiglione; Aymen Meddeb; Bardia Khosravi; Bastien Le Guellec; Charles E. Kahn, Jr.; Chong Hyun Suh; Daniel Pinto dos Santos; Dow-Mu Koh; Eleftherios Tzanis; Elmar Kotter; Errol Colak; Felipe Kitamura; Felix Busch; Felix Nensa; Guang Yang; Henning Müller; Jakob Nikolas Kather; Jawed Nawabi; Jens Kleesiek; Jingyu Zhong; João Santinha; Johannes Haubold; José Guilherme de Almeida; Karim Lekadir; Kostas Marias; Lara Noelle Reiner; Lena Maier-Hein; Linda Moy; Lisa C. Adams; Luis Martí-Bonmatí; Magdalini Paschali; Mana Moassefi; Matthias Dietzel; Merel Huisman; Michael Ingrisch; Michail E. Klontzas; Nikolaos Papanikolaou; Oliver Diaz; Paulo Kuriki; Philipp Seeböck; Pouria Rouzrokh; Quirin D. Strotzer; Seong Ho Park; Shahriar Faghani; Soroosh Tayebi Arasteh; Su Hwan Kim; Vasantha Kumar Venugopal; Woojin Kim; Burak Kocak

doi:10.4274/dir.2026.263812

ABSTRACT

PURPOSE

To develop the REporting checklist for FoundatIon and large laNguagE models (REFINE), an international reporting guideline for transparent and reproducible reporting of foundation model (FM) and large language model (LLM) studies in medical research, including imaging artificial intelligence (AI) applications.

METHODS

The protocol was prespecified and publicly archived. A modified Delphi process was conducted to establish reporting standards for unimodal and multimodal FM and LLM applications involving text, imaging, and structured data. The steering committee coordinated protocol development, expert recruitment, all Delphi rounds, and the harmonization phase. Decisions were made based on predefined consensus thresholds. In Rounds 1 and 2, structured ratings and free-text feedback informed iterative revisions. In the post-Delphi harmonization phase, terminology was standardized, and detailed reporting instructions were finalized.

RESULTS

The REFINE development group comprised 57 contributors from 17 countries, and 54 panelists from 16 countries completed Rounds 1 and 2. The harmonization phase was completed by three expert panelists and the steering committee. The entire process produced a 44-item, six-section framework with standardized terminology and detailed reporting instructions, supported by an online platform for practical use (https://refinechecklist.github.io/refine/checklist.html).

CONCLUSION

The REFINE provides a comprehensive, consensus-based reporting standard for medical FM and LLM research, including imaging AI studies. The online version facilitates practical implementation.

CLINICAL SIGNIFICANCE

The REFINE enables transparent, comparable, and reproducible reporting of FM and LLM studies, supporting reliable evidence synthesis in medical and imaging-focused AI studies.

Keywords:

Foundation models, large language models, artificial intelligence, reporting guidelines, medical imaging, Delphi consensus

Main points

• The REporting checklist for FoundatIon and large laNguagE models (REFINE) is an international Delphi-based reporting guideline for studies that use foundation models (FMs) and large language models (LLMs) in medical research.

• The guideline covers six domains: model specification, prompt design, stochasticity control, dataset integrity, output evaluation, and implementation.

• The REFINE items capture critical risks and dependencies inherent to FMs and LLMs that are not entirely addressed in previous reporting frameworks.

• The REFINE is supported by an open, easy-to-use, and multifunctional online platform (https://refinechecklist.github.io/refine/checklist.html).

• Using the REFINE can improve the transparency, reproducibility, and critical appraisal of FM and LLM studies for all key stakeholders, including authors, reviewers, and journal editors.

The rapid integration of foundation models (FMs) and large language models (LLMs) into medicine, ranging from complex diagnostics to patient triage,^{1, 2} is outpacing the scientific community’s capacity to conduct rigorous evaluation. These concerns are amplified by the opaque and stochastic behavior of these systems, which limits the applicability of traditional reporting guidelines and contributes to the growing challenge of ensuring reproducibility.

Although several meta-analyses have evaluated LLMs in healthcare, their reliability is limited by fragmented and inconsistent reporting.^3-7 The lack of standardized methodologies and reporting practices, combined with the proprietary black-box nature of these systems, makes comparison of findings challenging.^{3, 7, 8}

FMs and LLMs require distinct reporting standards because their behavior depends on factors that are largely not captured in traditional checklists. These include sensitivity to prompting strategies,^9-11 training dataset specification (e.g., knowledge cutoffs),¹² and the stochastic nature of output generation (e.g., influenced by temperature).^{13, 14} Furthermore, the scale of these models requires stronger governance regarding intended use, safety, and bias.¹⁵

To address these gaps, this paper introduces the REporting checklist for FoundatIon and large laNguagE models (REFINE) in medical research (Figure 1). The REFINE is a consensus-based checklist that provides clear, item-level guidance to support rigorous reporting and critical appraisal of FM- and LLM-based generative artificial intelligence (AI) studies in medical research, including imaging-focused studies.

Methods

Study design

The REFINE was developed using a modified Delphi process. A steering committee (IM, TAD, and BK) developed the protocol and initial set of items, coordinated panel recruitment, and conducted all Delphi rounds and the harmonization phase.

The prespecified protocol, including voting rules, consensus thresholds, and round closure criteria, was deposited on the Open Science Framework before recruitment and was followed without significant deviation. It can be accessed via the following reference.¹⁶

Scope definition

The steering group defined the scope to develop reporting standards for FMs and LLMs in medical research. Both unimodal and multimodal applications, including text-only, imaging, and structured data studies, are within the scope. The principal intended users of the REFINE are researchers who design, conduct, report, and assess studies involving these models, including authors, reviewers, and editors across medical fields.

Initial item development

First, a review of the relevant literature, including guidelines and methodological works, was conducted.^17-28 Based on this review, an initial item set was drafted, refined for clarity, and organized into distinct sections. This initial item set was used for Round 1.

Panel selection and recruitment

Experts were selected to ensure broad representation across clinical imaging, machine learning, FM and LLM development, medical informatics, methodology, and editorial domains. Invitations were sent directly via email and briefly outlined the aims of the REFINE, the Delphi process, and the co-authorship criteria. Email addresses were used strictly for recruitment and were not linked to survey response data to ensure anonymity.

Anonymity and consent

Each panelist received a unique code to maintain anonymity during voting. These codes enabled tracking of participation while keeping individual responses anonymous. Consent was implied through the entry of the code and the submission of responses. No email addresses were collected. Responses were stored securely and used exclusively for the REFINE project.

Consensus criteria and decision rules

Panelists rated each item as “keep as is,” “keep with modification,” “remove,” or “unsure.” “Unsure” responses did not count toward consensus. Consensus to keep an item required at least 75% of panelists selecting either “keep as is” or “keep with modification.” If one-third or more of these votes indicated “keep with modification,” the item was revised according to panelists’ comments. Consensus to remove an item required at least 75% of panelists selecting “remove.” Items without consensus, as well as those meeting the keep threshold but exceeding the modification threshold, were revised and re-rated in the next round. Items still lacking consensus after Round 2 were removed.

New items were added if proposed by at least two panelists or by one panelist with steering group approval.

Free-text comments were collected for each item, each section, and at the end of Rounds 1 and 2 to inform potential item revisions.

In all other procedural decisions, the steering committee acted by majority vote.

Modified Delphi procedure

Stage 1 (preparation)

The steering group refined the initial items and section structure and tested the survey internally before distributing it to the panelists.

Stage 2 (voting rounds and harmonization phase)

Round 1 (the first formal Delphi round): All items were presented to the entire panel via Google Forms. Panelists provided ratings and free-text comments. The round remained open for 2 weeks, with extensions permitted to maintain adequate participation.

Round 2 (the second formal Delphi round): Items that did not reach consensus, items that reached consensus but required revision based on Round 1 feedback, and any newly proposed items were re-rated. In this round, panelists were also asked to indicate which response options the final checklist should include: i) Yes, No, and N/A or ii) Yes, Partial, No, and N/A. The round remained open for another 2 weeks, with extensions permitted to maintain adequate participation.

Post-Delphi harmonization phase: Following Round 2, the steering committee drafted reporting instructions for each item and invited a small expert group (CB, KB, and RC) from the panel to review them and provide revisions when needed. Under the direction of the steering committee, this group resolved remaining issues, finalized item placement and wording, and established standardized terminology through discussion. This stage produced the final checklist. This phase took place in Google Docs and remained open for 2 weeks.

Statistical analysis

Responses were summarized using descriptive statistics, including proportions meeting the prespecified consensus thresholds. No additional complex statistical analyses were required.

Results

Expert panel characteristics and participation

A total of 55 experts were invited, of whom 54 participated in the Delphi voting rounds, representing 16 countries and multiple disciplines. Including the three steering committee members, the REFINE development group comprised 57 contributors from 17 countries. The combined group composition reflects a high concentration of expertise in radiology-driven AI (68%) and participants predominantly from Germany and the United States (51%), as detailed in Figures 2 and 3.

In Round 1, 54 panelists submitted complete ratings. In Round 2, the same 54 panelists participated. No withdrawals occurred while the rounds were open.

Item evolution

The initial draft included 39 items across five sections. In Round 1, all items met the consensus threshold. Three exceeded the modification threshold and required re-voting; one of these was split into two, yielding four items for re-evaluation. Panel feedback also led to editorial refinements and several new item proposals.

Round 2 evaluated 13 items in total: the four re-evaluation items and nine new proposals. A new section was added, and items were reassigned accordingly. All 13 items achieved consensus, followed by further editorial adjustments and expanded instructional text.

Across the rounds, some consensus items were split into distinct items or combined into a single item to improve clarity.

The harmonization phase finalized the checklist structure, item names and wording, and detailed reporting instructions while maintaining the six-section framework established in Round 2.

Terminology and definitions established and used in the REFINE

To reduce ambiguity in the reporting of FMs and LLMs, the steering committee and the selected expert group established a set of standardized terms during the harmonization phase. These terms describe key stages of model development and evaluation. The standardized terminology is presented in Table 1.

Final REFINE structure

The final REFINE checklist contains 44 items across six sections (model specification, prompt design, stochasticity control, dataset integrity, output evaluation, and implementation). Table 2 provides the complete REFINE checklist. Figure 4 summarizes the consensus statistics for all finalized REFINE items.

Each item includes concise but detailed reporting instructions to support consistent reporting. These instructions clarify intent and provide practical guidance for authors. Table 3 presents the full set of item-level reporting instructions.

The response set used in the final checklist (Yes, Partial, No, and N/A) reflects the preference expressed by the absolute majority of panelists during Round 2.

Web version of the REFINE

A mobile-compatible online version of the REFINE is available at https://refinechecklist.github.io/refine/checklist.html. This version is practical to use and is the recommended format. It integrates the content presented in Tables 2 and 3 by linking each item to its reporting instructions through a tooltip. The online version also provides a real-time summary of completion by section and overall completion. Users can print the checklist to PDF for submission along with their manuscript, export the data as an Excel table for use in systematic reviews, and download the summary statistics image for presentation of their research. Figure 5 illustrates the main functionalities of the web version of the REFINE.

Discussion

Principal findings

In this study, we developed the REFINE, a consensus-based reporting guideline designed to address the opacity and heterogeneity of FMs and LLMs in medical research. Unlike general AI reporting guidelines, the REFINE explicitly targets sources of variability and risks unique to generative AI, spanning model specification, prompt design, stochasticity control, dataset integrity, output evaluation, and implementation. By grounding the checklist in a formal international Delphi consensus process, the REFINE provides a pragmatic standard to improve the quality, consistency, and reproducibility of this rapidly evolving field. Although the consensus panel included strong representation from imaging-related disciplines, the resulting checklist items, particularly those governing prompt engineering, stochasticity control, and dataset contamination, address fundamental properties of FMs and LLMs that apply to text-only, multimodal, and imaging workflows alike.

Relation to existing guidelines

The REFINE is designed to complement established EQUATOR-aligned guidelines. Frameworks such as CLAIM,²⁹^,³⁰ CONSORT-AI,³¹ TRIPOD-AI,³² and STARD-AI ³³ provide a robust foundation for study design, participant selection, reference standards, and performance metrics but were developed before the widespread adoption of generative AI. Consequently, they offer limited coverage of several characteristics specific to FMs and LLMs, such as stochasticity and prompt engineering.

Recent efforts have emerged to address this reporting gap.^17-19^,^34-37 The TRIPOD-LLM framework extends TRIPOD-AI using a modular checklist to cover model development and evaluation, specifically within the context of diagnostic and prognostic prediction models.¹⁸ Similarly, MI-CLEAR-LLM establishes minimum reporting items for accuracy reports in healthcare, with a specific focus on handling stochasticity, prompt syntax transparency, and model access modes.¹⁷^,³⁷ To accommodate varying levels of technical depth, the DEAL checklist introduces dual pathways, one for advanced model development and another for off-the-shelf applications.¹⁹

Other initiatives target specific use cases or ethical dimensions. The CHART statement focuses on studies evaluating chatbot health advice, emphasizing query strategies and prompt engineering for clinical advice summarization;³⁴ CANGARU addresses the ethical use and disclosure of generative AI tools within the academic writing and publishing process itself.³⁶ Additionally, CRAFT-MD provides a framework specifically for evaluating conversational reasoning through simulated doctor–patient interactions rather than a general reporting structure for study methodology.³⁵

The REFINE distinguishes itself within this ecosystem by integrating technical reproducibility with broader implementation governance. Although guidelines such as MI-CLEAR-LLM focus on the details of accuracy testing (e.g., temperature settings, prompt syntax) to some extent, the REFINE expands these requirements and extends them across the full study lifecycle, mandating reporting on dataset integrity (e.g., contamination risks, representational bias) and clinical implementation (e.g., workflow integration, failure analysis, and safety protocols). Thus, the REFINE serves as a comprehensive standard for documenting both the generative parameters and the clinical reliability of FM and LLM studies.

The REFINE is also intended to be used alongside other AI reporting tools. For example, a randomized trial involving an LLM would report the trial design using CONSORT-AI and the model methodology using the REFINE.

Contributions of the REFINE

The REFINE introduces critical reporting requirements that address the non-deterministic nature of generative AI. First, it mandates detailed reporting of model specifications. Unlike traditional algorithms, models with similar names may differ considerably due to access configuration, quantization, tooling, and safety alignment layers, all of which determine validity and generalizability.^38-43 Second, the REFINE requires explicit documentation of prompt engineering protocols with the same rigor as code in deterministic algorithms, including the specific context provided. Third, it enforces detailed reporting of generation parameters (e.g., temperature, top-p), which can significantly reshape output distributions and are critical for model performance and reproducibility.¹³^,²⁴^,⁴⁴^,⁴⁵ Without these, identical models may produce divergent outputs, rendering a study irreproducible. Fourth, the REFINE addresses dataset integrity by assessing the risk of contamination (i.e., overlap between evaluation datasets and the model’s pretraining corpus), which is a major challenge for fairly evaluating FM and LLM performance.⁴⁶^,⁴⁷ Fifth, the REFINE emphasizes structured reporting of interaction style, session memory, tool use, retrieval-augmented generation, and multimodal integration, which are central to modern FM and LLM applications. Finally, the REFINE incorporates implementation-focused items, requiring authors to report monitoring for misuse and failure modes specific to clinical workflows.

Practical use and implementation

The REFINE serves as a comprehensive, practical tool for multiple stakeholders. For authors, the core checklist acts as a prospective design aid to ensure key elements are considered during study planning, whereas detailed item instructions support manuscript preparation. For reviewers and editors, the REFINE can serve as a structured appraisal tool to systematically evaluate methodological transparency, reducing reliance on individual familiarity with rapidly evolving technical details. It can also help identify specific gaps that limit interpretability or reproducibility.

The REFINE has the potential to be adopted and reinforced at the level of journals, conferences, and professional societies. We propose that journals integrate the REFINE into their author instructions and editorial policies to normalize the use of these standards. Endorsement by major bodies or societies may facilitate broader adoption.

Strengths and limitations

The REFINE has several notable strengths. First, it was developed by an international and multidisciplinary panel, which supports its applicability across settings. Second, the checklist was developed through a predefined and transparent Delphi process with explicit consensus thresholds and decision rules, thereby reducing the risk of bias. Third, the availability of a user-friendly online platform further facilitates practical and consistent use. Fourth, the REFINE is applicable across diverse study designs; the inclusion of an “N/A” option functions as a deliberate filtering mechanism, allowing investigators to exclude non-applicable items without penalizing overall checklist completion.

The REFINE also has several limitations. Although the panel was international and multidisciplinary, its composition may still introduce bias, including a predominance of imaging experts and an underrepresentation of certain geographies, specialties, and stakeholder groups. Consequently, some domain-specific reporting needs, particularly those outside imaging-intensive disciplines or resource-rich healthcare contexts, may not be fully captured. Furthermore, although the checklist was developed via expert consensus, formal pilot testing with external users to validate usability was not conducted before release. In addition, the modified Delphi process, though systematic, remains dependent on subjective judgments. Finally, the REFINE was developed in the context of rapidly evolving FM and LLM technologies, regulatory expectations, and clinical use cases. The checklist, therefore, reflects the best available knowledge but requires adaptation and updates as model capabilities evolve.

Future directions and planned updates

We plan to update the REFINE through a formal re-evaluation of its items every 2 years, guided by feedback from users and the community, developments in related reporting standards, and emerging evidence on FM and LLM deployment in healthcare. In parallel, future work may also explore domain-specific extensions or modular add-ons such as radiology-focused variants, imaging-intensive implementations, text-only clinical documentation modules, and decision-support modules while preserving a common core.

An additional priority is to evaluate the uptake, usability, and impact of the REFINE in practice. This may include surveys or qualitative studies of authors, reviewers, and editors; bibliometric analyses of reporting quality before and after journal endorsement; and targeted audits of FM and LLM studies using the REFINE. These evaluations will help identify challenging items, clarify where further guidance is needed, and determine how the REFINE can best support transparent and high-quality reporting as the field evolves.

Final remarks

The integration of FM and LLM into medicine demands reporting standards that match their complexity, risks, and clinical implications. Without rigorous documentation, evidence generated from these systems will remain difficult to trust and reproduce. The REFINE directly addresses this gap by providing a consensus-built framework that clarifies what must be documented. Its adoption offers a practical foundation for transparent, reproducible, and ultimately trustworthy medical AI research.

Acknowledgements

Language of this manuscript was checked and improved by generative AI (ChatGPT-5 and 5.2; Gemini 2.5 and 3 Pro). The authors conducted strict supervision when using these tools.

Funding

This study received no specific funding.

Conflict of interest disclosure

T. Akinci D’Antonoli serves as Section Editor for Diagnostic and Interventional Radiology. She had no involvement in the peer-review of this article and had no access to information regarding its peer-review. A. Chaudhari receives research support from GE Healthcare, Philips, Microsoft, Amazon, Google, NVIDIA, and Stability; provides consulting services to Patient Square Capital, Chondrometrics GmbH, Elucid Bioimaging, and Cognita Imaging; is a co-founder of Cognita Imaging; and holds equity interests in Subtle Medical, LVIS Corp, Brain Key, and Radiology Partners. B. Khosravi serves as Associate Editor of Radiology: Artificial Intelligence. C.E. Kahn Jr. serves as Editor of Radiology: Artificial Intelligence. D.M. Koh provides consultancy to GE Healthcare and GlaxoSmithKline (GSK) and maintains research collaborations with Siemens Healthineers, QED, and Mint Medical. F. Kitamura is a consultant for Bunkerhill Health, GE Healthcare, and MD.ai; a speaker for Sharing Progress in Cancer Care; holds leadership roles as Early Career Consultant to the Editor of Radiology, Associate Editor of Radiology: Artificial Intelligence, Vice-chair of the SIIM ML Committee, and member of the RSNA AI Committee and RSNA Radiology Informatics Council; and serves on the Data Safety Monitoring Board for the LuANA Trial. F. Nensa serves as Associate Editor for Investigative Radiology, Section Editor (AI) for European Journal of Radiology, and Editor for European Journal of Radiology Artificial Intelligence. J.N. Kather provides consulting services for AstraZeneca and Bioptimus; holds shares in StratifAI, Synagen, and Spira Labs; has received institutional research grants from GSK and AstraZeneca; and has received honoraria from AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer, and Fresenius. J. N. Kather is also supported by the German Federal Ministry of Research, Technology and Space BMFTR (Come2Data, 16DKZ2044A; NextBIG, 01ZU2402A), the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence Strategy – EXC 2050/2 – Project ID 390696704 – Cluster of Excellence “Centre for Tactile Internet with Human-in-the-Loop” (CeTI) of Technische Universität Dresden, the German Academic Exchange Service DAAD (SECAI, 57616814), and the European Research Council ERC (NADIR, 101114631). L. Moy serves on the ACR Data Safety Monitoring Board and the Society of Breast Imaging Board of Trustees; is on the editorial board of JMRI; receives a Siemens Research Grant; and receives personal fees from Bracco and Medscape. M. Dietzel serves as Editor-in-Chief of European Journal of Radiology Artificial Intelligence and Deputy Editor-in-Chief of European Journal of Radiology. M. Huisman has received speaker honoraria from Canon, Sonoskills, and AbbVie; serves on the Medical Advisory Board of xAID LLC; received a grant reviewing honorarium from the NN Foundation; received support for travel from ESR/EuSoMII; holds leadership roles including EuSoMII Vice President Elect (2025–26), member of the ESR eHealth & Informatics Subcommittee, AI committee member for FMS and UEMS, Chair of the AI Task Force Biomedical Alliance, and Deputy Editor of Radiology: Artificial Intelligence. S. Faghani serves as Associate Editor of Radiology: Artificial Intelligence. S. Tayebi Arasteh serves as an editorial board member for Communications Medicine and European Radiology Experimental, and as a trainee editorial board member for Radiology: Artificial Intelligence. W. Kim serves as Chief Strategy Officer and CMIO at HOPPR; CMO at the American College of Radiology Data Science Institute; is on the Advisory Boards of Alara Imaging, Braid Health, ImageBiopsy Lab, and Luxsonic Technologies; is an Advisor and Shareholder at Rad AI; and is a Consultant for Hyperfine Research and Philips. B. Kocak served as Section Editor for Diagnostic and Interventional Radiology during the conduct of this study. He had no involvement in the peer-review of this article and had no access to information regarding its peer-review. All other authors declare no conflict of interest.

References

Bedi S, Liu Y, Orr-Ewing L, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA. 2025;333(4):319-328.