ABSTRACT
Foundation models (FMs) represent a significant evolution in artificial intelligence (AI), impacting diverse fields. Within radiology, this evolution offers greater adaptability, multimodal integration, and improved generalizability compared with traditional narrow AI. Utilizing large-scale pre-training and efficient fine-tuning, FMs can support diverse applications, including image interpretation, report generation, integrative diagnostics combining imaging with clinical/laboratory data, and synthetic data creation, holding significant promise for advancements in precision medicine. However, clinical translation of FMs faces several substantial challenges. Key concerns include the inherent opacity of model decision-making processes, environmental and social sustainability issues, risks to data privacy, complex ethical considerations, such as bias and fairness, and navigating the uncertainty of regulatory frameworks. Moreover, rigorous validation is essential to address inherent stochasticity and the risk of hallucination. This international collaborative effort provides a comprehensive overview of the fundamentals, applications, opportunities, challenges, and prospects of FMs, aiming to guide their responsible and effective adoption in radiology and healthcare.
Main points
• Foundation models (FMs) are versatile artificial intelligence (AI) systems pre-trained on large, diverse datasets, enabling them to adapt to many tasks with minimal fine-tuning.
• FMs with multimodal capacities offer powerful tools for complex radiological applications, such as report generation and diagnostic decision-making.
• FMs have the potential to democratize AI in healthcare by requiring less local data for fine-tuning, helping under-resourced centers.
• Major challenges to FM use in imaging include stochasticity, hallucinated outputs, transparency, bias, sustainability, and regulations.
Artificial intelligence (AI), particularly deep learning (DL), has demonstrated considerable efficacy in medical image analysis across various imaging modalities.1, 2 Traditionally, however, AI models in healthcare have been mostly developed for narrow tasks that are highly specific and limited. The recent emergence of foundation models (FMs) represents a significant paradigm shift.3, 4 These large DL models exhibit broad adaptability to a wide range of downstream tasks with minimal task-specific modification.5, 6
A notable example of FMs is represented by large language models (LLMs), optimized for language-centric tasks, such as summarization, translation, and answering questions.7 Although LLMs primarily process text, the broader category of FMs can encompass multiple modalities, including text, images, audio, and a diverse spectrum of unstructured data.8, 9 This inherent multimodality aligns well with the diverse data types encountered in modern medicine, such as imaging, clinical narratives, laboratory results, and genomic information.10-12
Although current radiology workflows predominantly utilize task-specific models, the multimodal capabilities of FMs make them particularly promising for this field, offering potential support across various interpretative and non-interpretative scenarios (Figure 1).13, 14 The capabilities of LLMs have already been explored for several radiology-related tasks, including report generation,15 multilingual report translation,16 information extraction from free-text reports,17 and the assessment of domain-specific radiological knowledge.18 Despite growing interest, the use of FMs in radiology is still in the early stages, with ongoing active research and development.13, 14, 19-21
To facilitate the development and potential adoption of FMs, this narrative review synthesizes current knowledge about FMs and aims to provide a comprehensive overview of FMs in the context of radiology. It introduces the fundamental concepts behind FMs, examines their potential applications in radiology, highlights emerging opportunities, outlines key challenges, and suggests future directions in both research and practice.
Fundamental concepts of foundation models
FMs mark a fundamental shift within the conceptual hierarchy of AI (Figure 2), moving beyond conventional, narrowly focused AI systems. They are a class of large-scale AI models developed through training on vast and diverse datasets.3 A defining feature is their pre-trained nature; unlike conventional models engineered for a single, narrow task (e.g., solely lung nodule detection or lung segmentation), FMs serve as versatile base models (created through pre-training), adaptable to numerous downstream applications (through fine-tuning, continued training on smaller, task-specific datasets).
This inherent adaptability results from several key characteristics (Figure 3). First, the pre-training stage of FMs usually leverages self-supervised learning, allowing the model to learn rich data representations from the data itself (e.g., by solving pretext tasks, such as predicting masked portions of an image or a text) using unstructured, unlabeled, or weakly labeled data (Figure 4).22, 23 This contrasts sharply with conventional methods, which typically require significant amounts of high-quality (manually) labeled data for each distinct task, a major bottleneck due to cost and expert time. Although FMs still require labeled data for fine-tuning, the reliance on specific data for each application can be substantially reduced compared with conventional AI methods (Figure 5).
Second, the reduced need for labeled data allows FM development on a large scale, referring to model size, computational resources, and dataset size.24 This scale enables the models to learn more generalizable and robust representations that, in turn, support the scalability of the FMs themselves (larger models become feasible with more training data, and more generalizable representations apply to more possible downstream applications). This differentiates FMs from conventional models, which often exhibit limited generalizability beyond the precise conditions (e.g., patient populations or tasks) for which they were trained. Additionally, scaling models has led to the emergence of functionalities beyond the explicit training objectives,3such as instruction following, capabilities unprecedented (or at least hardly detectable) in smaller-scale models.25-27
Finally, self-supervised learning and the scale of FMs equip them with strong transfer learning capabilities.28-31 The general knowledge acquired during the resource-intensive pre-training phase can be effectively utilized for new, specific tasks through minimal fine-tuning. This facilitates few-shot learning (where only a small number of task-specific examples are provided) and zero-shot learning (using no examples),32-34 where models adapt with substantially less specific data than conventional approaches demand. For instance, an FM pre-trained via self-supervised learning on large chest X-ray datasets may be fine-tuned for rib fracture detection using only dozens of cases, whereas a conventional model may require thousands to reach comparable performance.
Developing multimodal foundation models
FMs first took shape in natural language processing (NLP) in the form of LLMs, such as Generative Pre-trained Transformer (GPT)-4 (OpenAI) and Claude (Anthropic). Although FMs can be unimodal, focusing exclusively on one data type, such as text (in the case of LLMs) or images,35, 36 a development of particular importance for radiology is their potential to be multimodal by being able to process and integrate diverse data types, including images [e.g., X-rays,37-40 computed tomography (CT),41, 42 and magnetic resonance imaging (MRI)43], text (e.g., reports and other electronic health record documents), and potentially many more (Figure 1).
Key concepts and modules of FMs concerning radiological applications are presented in Figure 6. Although architectures vary, the transformer design is a frequently used backbone.44 Its central feature, the attention mechanism, allows it to focus on specific elements of the input sequence. This enables the model to capture long-range dependencies and contextual relationships within data effectively, which gave rise to its initial success in NLP and subsequent adaptation for vision and multimodal scenarios.45, 46 A key concept in handling diverse inputs is the use of modality-specific encoders.47 These components compress high-dimensional inputs (such as CT scans or text reports) into lower-dimensional embeddings (i.e., numerical vector representations), capturing essential features (e.g., tissue density, anatomical structures, radiological terms). Common encoder architectures include vision transformers and convolutional neural networks for images, and transformers for processing text data. To enable the model to understand relationships across different data types, techniques such as contrastive learning are often employed during pre-training (Figure 7). For instance, the model learns that a specific chest X-ray and its corresponding report describe the same case. Model weights are adjusted so that the embeddings for a matching image–report pair are pulled closer together in a conceptual “shared space,” whereas embeddings for unrelated pairs (e.g., the same chest X-ray paired with a report from a different patient) are pushed further apart to learn meaningful cross-modal associations. The Contrastive Language–Image Pre-Training (CLIP) model is a dual neural network trained on a variety of image and text data pairs and is an early example of FMs created this way.45
After the individual encoders have processed their respective inputs, fusion modules are used to combine this information, which can happen in several ways (Figure 8). Mechanisms such as cross-attention are particularly powerful here, allowing the model to weigh dynamically the relevance of different parts of one modality based on the content of another–for example, attending to specific words in a report when analyzing a corresponding slice in a CT scan.
Finally, decoders transform these fused representations into desired outputs, which could range from generating text (e.g., report summaries) and predicting classes or outcomes to segmenting relevant image regions, depending on the specific application. Adapting existing FMs to a specific task can be achieved through full fine-tuning (updating all model parameters) or less computationally expensive parameter-efficient fine-tuning techniques that update only a small number of parameters (Figure 9).48
These characteristics make FMs versatile, adaptable, and data-efficient AI models that can integrate multimodal data and capture long-range dependencies within high-dimensional data and across different modalities that may elude narrower models. This uniquely positions FMs to tackle complex problems from the field of radiology by providing potentially richer, more contextualized insights that reflect clinical reality better than conventional AI models.47
Applications in radiology
Creating a radiology-specific FM from scratch could be highly cost-intensive, as radiology consists of a wide range of imaging modalities, including X-rays, ultrasound, nuclear imaging, and MRI, that have significant variations in their underlying technologies and data characteristics.49 Nevertheless, recent advances have shown promising pathways to adapt or fine-tune general-purpose models for domain-specific tasks and modalities, leading to a growing number of successful applications in radiology.
One core application is medical image segmentation, which aims to delineate regions of interest, such as lesions or organs, automatically. DL models, especially those using the nnU-Net architecture, have shown high accuracy in normal anatomical segmentation tasks.50-52 However, the challenge arises when it comes to pathologies, where separate models need to be trained for each one of them; for example, a model trained for liver tumor segmentation cannot be directly applied to the lung tumor or prostate cancer segmentation tasks.
In April 2023, the Segment Anything Model (SAM) was introduced, demonstrating the potential for a single model to handle various segmentation tasks across different domains without needing retraining or fine-tuning.53 Despite this, SAM’s performance on complex medical segmentation tasks, such as those involving the pancreas and spine, has not been satisfactory.51, 54, 55 Vision FMs, such as SAM, can serve as starting points, which are then adapted into modality/task-specific FMs (e.g., Medical SAM 2) optimized by leveraging the unique characteristics of medical imaging modalities.37, 56, 57
Another notable development is UniverSeg, a single task-agnostic model trained using a large and diverse set of open-access medical datasets.58 This model can generalize to the anatomies and segmentation tasks that were not in the training set or were never seen by the model previously. Notably, UniverSeg significantly outperformed existing few-shot methods across all held-out datasets. However, it is important to note that the model was only applied to two-dimensional data and single-label segmentation, and its performance for three-dimensional medical image data remains unclear.
FMs have also shown promise for lesion identification and characterization in different clinical scenarios.19, 59 For example, CXRBase was developed using a large collection of unlabeled chest X-ray images through self-supervised learning.59 This approach was sequentially applied to both natural images from ImageNet-1k and chest X-ray images from various public datasets, encompassing a total of 1.1 million chest X-ray images. The model demonstrated good performance across multiple datasets from different centers for diagnosing diseases such as coronavirus disease 2019, pneumonia, and tuberculosis.
Radiology report generation and comprehension represent further promising areas for multimodal FMs. In the task of generating radiology reports, these models can identify abnormalities within images from various modalities while incorporating the patient’s medical history and clinical examination findings.60 By integrating both text and images, these models can generate precise radiology reports, help standardize report quality by detecting inconsistencies or omissions, and subsequently reduce the workload for radiologists.61 Additionally, these models can generate reports in multiple languages and adjust the complexity of the language to suit the target audience, providing detailed content for specialists and simplified versions for general practitioners.16, 62
For the comprehension task, physicians can also use multimodal models to enhance case comprehension by engaging in text-based dialogues that focus on specific image sections, allowing for detailed descriptions of those areas.63 Furthermore, the reports generated by FMs can offer preliminary diagnoses, supporting radiologists and clinicians in their decision-making process.64 These models can potentially propose treatment options or recommend additional diagnostic tests, enhancing the overall clinical workflow for personalized medicine.65
Opportunities for radiology
Building on the capabilities outlined above, the adoption of FMs in radiology presents several strategic opportunities (Figure 10).
A key opportunity lies in the ability to fine-tune pre-trained FMs with smaller, local datasets.66 This can reduce the inequalities related to the availability of data,67 democratizing access to these applications for healthcare systems with limited data or access to infrastructure. The lower reliance on large datasets can allow their use in centers with limited funding or limited population coverage, which prevents the collection of a high number of cases.
Fine-tuning the models with local datasets can mitigate biases related to underrepresented population characteristics or local peculiarities related to equipment or radiological protocols. At the same time, leveraging their pre-training on large datasets, FMs trained on diverse populations can provide more equitable care recommendations, reducing diagnostic errors in underrepresented groups such as children, ethnic minorities, or patients with rare conditions.68
To further address data imbalance, techniques such as synthetic data generation can be used. FMs have the potential to create synthetic medical images, such as CT scans, MRIs, and X-rays, that resemble real-world data.69, 70 These artificially generated datasets can serve as valuable supplements to existing image collections, particularly when access to patient data is restricted due to privacy issues or limited availability. By generating variations of medical images, these models can help address imbalances in datasets, effectively representing a broader spectrum of pathologies.
The ability to build upon pre-trained backbones has the potential to shorten the innovation-to-implementation cycle significantly. Researchers can build on top of models trained to capture broad medical imaging features and clinical context, rather than creating new models from scratch.71 Reducing the duration of the innovation-to-implementation cycle can accelerate the development of novel applications, simplify cross-institutional collaborations, and allow innovations developed in academic settings to be rapidly tested and adapted in hospitals, startups, or public health agencies.
FMs also offer unique educational benefits. Automatically annotating synthetic or real images with detailed descriptions, such as the identification of lesions, tumors, or anatomical landmarks, can help radiology residents and healthcare professionals quickly understand complex images. Furthermore, FMs can enhance educational content by not only annotating images but also offering detailed explanations of pathologies, their clinical importance, and treatment options, creating an engaging and interactive learning experience.21
Moreover, patients will be empowered as their health data, such as their medical history and imaging results, could be translated into personalized, simplified explanations of their condition, treatment options, and possible outcomes with FMs, thus helping patients gain a clearer understanding of their specific health situation.72, 73
FMs introduce a new paradigm in precision medicine by enabling the integration of diverse data types, such as the combination of diagnostic images, omics, clinical, and laboratory data.10 Radiologists interpret images in light of clinical information and questions and combine or harmonize these different types of information without much effort in their day-to-day jobs. However, existing AI applications have been inherently less accurate than radiologists, wherever such data harmonization was required. With the advance of FMs, there has been a significant leap towards the combination of multimodal data, which enables more accurate prognostication, risk stratification, and treatment planning.74
Beyond data integration, FMs also have the capacity to support complex diagnostic reasoning in uncertain or ambiguous clinical situations. In real clinical practice, radiologists often deal with cases that are not clear-cut, where the diagnosis is not obvious, and decisions have to be made with incomplete information, taking into account follow-up data. FMs can execute different tasks that can help in these situations by offering insights that take the full clinical context into account, rather than just giving a simple yes/no answer.47 Such models can be trained on a variety of tasks representing real diagnostic scenarios, where the radiologist is presented with imaging examinations that depict a series of pathological conditions and require complex reasoning.
To provide a perspective on emerging opportunities, a comparative overview of conventional AI (single or multimodal), multimodal FMs, radiologist interpretation, and a combined radiologist-FM approach is presented in Table 1.
Challenges and risks for radiology
As discussed so far, although FMs hold promise for transforming radiology, they also introduce multifaceted challenges. Radiologists must remain aware of these issues and proactively address them to ensure the safe, ethical, and effective implementation of FMs in radiology (Figure 11).
One of the main concerns is the stochastic nature of these models, where their outputs may vary every time they are executed, and another is that they can generate plausible-sounding yet incorrect or entirely fabricated information (a phenomenon known as “hallucination”).7 Beyond these inherent issues, FMs present several broader challenges, especially in radiology and healthcare in general. These include challenges related to sustainability, transparency, ethics, cybersecurity, privacy, standardization, and validation.
Another concern is sustainability and environmental impact, as the development and deployment of FMs are highly resource-intensive. They demand vast computational power, energy, and even water. For example, generating a single image with a generative AI model can consume the equivalent of half a smartphone’s battery charge,75 while producing 10–50 medium-length chatbot responses may require up to half a liter of fresh water (Figure 12).76 Since radiology already depends on energy-intensive imaging equipment, implementing these models in radiology could further exacerbate environmental burdens.77, 78 Addressing these issues and promoting sustainable practices are essential to reducing the environmental impact of FMs in radiology.
FMs often operate as “black boxes,” producing outputs without providing clear explanations of their reasoning processes.79 In radiology, transparency and explainability are critical. Diagnostic decisions made by radiologists must be evidence-based to guide treatment plans and ensure patient safety. If a radiologist cannot justify a diagnosis aided by these tools, their trust in this decision may be undermined. Implementing models with reasoning capabilities, such as OpenAI’s GPT o1 series, or adopting frameworks that are designed to facilitate reasoning may help improve trust in model outputs.80
Moreover, FMs are prone to perpetuating or even amplifying biases present in the training data, which can contribute to healthcare disparities.67 In addition, unequal access to such technologies may further disadvantage under-resourced institutions.67 The legal framework surrounding AI in radiology is also still evolving, and questions about liability, especially in cases where diagnostic errors result from following or ignoring AI-aided recommendations, remain unresolved.81 Clear ethical guidelines and legal standards are needed to navigate these challenges responsibly.
Training FMs requires large datasets that may contain sensitive patient information. This raises substantial privacy concerns; therefore, ensuring rigorous data anonymization practices during the model training, as well as not using patient data directly as input during model implementation, is essential.79 FMs may also pose cybersecurity threats, and these tools could be exploited by malicious actors to extract sensitive patient data through techniques such as jailbreaking or to manipulate model output through techniques such as backdoor attacks.82, 83 Ensuring robust security protocols and continuous monitoring is essential to safeguard patient data and maintain trust when implementing these tools in radiology.
Evaluating the performance of FMs presents another set of challenges. Traditional metrics, such as accuracy or F1 score, may be inadequate for assessing generative outputs or for evaluating the model’s generation quality due to the lack of a reference standard.47 Moreover, regulatory guidelines for ensuring the clinical safety and efficacy of these models are still in their infancy, and each country or region creating its own set of frameworks makes it harder to disseminate these tools (e.g., the AI Act across Europe and the Food and Drug Administration medical device law across the USA).84 Rigorous validation processes and international regulatory alignment are necessary to overcome this hurdle. Besides the aforementioned challenges, foundational models introduce additional risks. For example, over-reliance on AI tools may lead to the deskilling of radiologists, weakening their ability to assess critically the AI-aided recommendations.85 Ongoing education and training for radiologists are essential to mitigate deskilling and ensure appropriate use of these technologies.
Prospects
As previously discussed, attention-based FMs have represented a technological leap in AI capabilities, with a wide range of potential applications in medical imaging. However, it should be noted that AI research is in continuous development, and even as LLMs and FMs are just starting to be employed in the radiology domain, novel technologies are already aiming to complement or substitute current architectures and improve upon their performance, alignment, and other limitations.
Among DL developments, state space models and recurrent neural networks currently represent promising architectures in the context of FMs, although the latter is a relatively mature technology.86-88 Both methods, with different specific implementations, including hybrid approaches,89 incorporate recursive computation and representation of longer data sequences compared with “pure” transformer-based models. Furthermore, as is often the case in AI, these represent, at least in part, implementations of concepts initially theorized decades ago,90 which have found new applications due to an increase in computational power and data availability.
Independent of the chosen neural network architecture, in the future, radiologists should expect (and increase their demand for) greater use of open-source software in the setting of generative AI and FMs. At the moment, this domain is largely dominated by proprietary (i.e., closed) technologies, which obfuscate the data used to train these tools, the network’s architecture, and the specific weights stored within the trained model. As previously mentioned, this lack of transparency represents a limitation to the implementation of FMs in healthcare, as well as running contrary to the principles contained within the European Union’s AI regulatory framework.91 Nevertheless, high-performance and large-scale FMs, which are also open-source, are already available, with Meta’s LLaMA being the most well-known. On the other hand, the increase in transparency afforded by open-source software comes with different considerations and potential tradeoffs. Although open-sourcing the model itself does not inherently compromise the privacy of the original training data (which are usually kept separate) or the input data, ensuring model security, preventing model misuse, and mitigating potential risks such as cyberattacks requires careful governance; this may represent a significant issue in sensitive contexts, such as medical imaging. Furthermore, although open source does not represent an outright impediment to patenting, it does present a greater degree of challenge in protecting the technology behind a medical device and allowing a company to extract the economic value necessary to justify the large-scale investments required to develop such devices and the models running in the backend. This tension between private companies and public interest is not new to healthcare and has been the object of long debates in, for example, the setting of pharmaceutics.92-94
AI and FMs can certainly look to these lessons to further establish the appropriate ethical and regulatory framework as these technologies increase their footprint in medical imaging, rather than attempting to reinvent the wheel. A clear sign of the relevance of these considerations is represented by the EU Commission’s recently announced intention to withdraw the proposed AI Liability Directive in its 2025 work program, demonstrating the regulator’s difficulties in balancing patient protection and incentivizing innovation.95, 96
AI and FMs will almost certainly impact healthcare, especially medical imaging,97 in the future. In this setting, radiologists will need to be ready to increase their involvement in multidisciplinary teams. Deployment (and development) of FMs will require the expansion of the expertise requirements in imaging departments and closer collaboration with information technology, data science, and machine learning operations professionals. It could also be argued that the current vision in this profession regarding the implementation of this type of AI is still limited and mostly based on “adding on” FM to the current clinical workflow.98 However, it is also possible that this may not be the best strategy to implement this technology and may lead to unmet expectations and low impact on patient outcomes.99, 100 Rather, the time may soon come to face the reality that FMs will require a radical rethinking of parts of the medical imaging practice: for example, regarding the scale of service delivery and role of the radiologist.101
Final thoughts
FMs represent a potential paradigm shift in AI, offering broad adaptability, multimodal integration, and improved generalizability across a wide range of tasks. In radiology, FMs have an immense potential to enable applications spanning image analysis, report generation, and integrative diagnostics across heterogeneous data sources. However, realizing this potential requires addressing key challenges, including issues of transparency, sustainability, data privacy, regulatory complexity, and ethical implementation. The inherent stochasticity and risk of bias in these models necessitate rigorous validation and continuous monitoring. Successful integration will require not only technical advancement but also adaptive clinical workflows and absolute transparency, potentially facilitated through open-source frameworks. Radiologists (along with other stakeholders) must play a central role in guiding the responsible development and deployment of FMs to ensure they augment, rather than undermine, the quality, safety, and equity of patient care. To this end, the authors of this international collaborative effort provide the radiology community with a set of practical recommendations based on the content extensively discussed in this work, to facilitate the better integration of FMs into clinical practice (Table 2). The authors hope that both the review and the accompanying recommendations will serve as a solid foundation for radiologists in adapting to rapidly evolving AI technologies, specifically FMs.