Dear Editor,
We thank the authors for their thoughtful and well-articulated comments on our study.
We agree that large language model (LLM) performance is inherently multidimensional and that temporal variability and response stochasticity are important considerations. As outlined in our Discussion, these aspects—including the use of a single response per query and the absence of repeated sampling—were explicitly acknowledged as limitations of our study.1
Our research was intentionally designed to provide a standardized baseline comparison across models. The single-response-per-query approach was adopted to ensure methodological consistency and comparability while recognizing that it does not capture the full variability of LLM outputs. In this context, the points raised by the authors are valid and consistent with the methodological considerations outlined in our manuscript.
We also concur that reference accuracy represents only one component of overall LLM performance. However, we believe it remains a particularly critical domain in radiology, where clinical and academic practice depends on accurate and verifiable sources.2, 3 From this perspective, our focused evaluation addresses a fundamental aspect of LLM reliability.
The authors’ emphasis on hallucination is particularly relevant. Our findings are consistent with prior studies demonstrating that fabricated or inaccurate references remain a persistent limitation across current LLMs, reinforcing the need for careful validation and human oversight.4-6
We agree that future research incorporating repeated sampling and broader performance metrics will further enhance the understanding of LLM behavior. Within this context, we believe our study provides a necessary and timely benchmark in this domain.
We thank the authors again for their valuable contribution to this discussion.


