Dear Editor,
We would like to thank the correspondent for their thoughtful and constructive comments on our article evaluating the diagnostic sensitivity of a multimodal large language model (MLLM) (ChatGPT-4V) for detecting intracranial hemorrhage in non-contrast cranial computed tomography.1 We appreciate the opportunity to clarify the rationale behind our experimental design and to outline directions for future work.
First, we agree that subtle and borderline hemorrhagic findings represent a well-known diagnostic gray zone, even for human readers, and that discordance may partly reflect intrinsic interpretive uncertainty rather than a purely model-specific limitation.1 In our dataset, ChatGPT-4V’s performance was clearly influenced by lesion conspicuity; larger hemorrhage diameters were associated with higher correct classification rates, particularly for epidural and subdural hematomas.2 This finding is consistent with the broader literature showing that MLLM performance using direct image inputs remains variable across tasks and settings and may be limited in the context of real-world radiologic interpretation.3, 4
Second, we fully concur that clinical contexts can materially shape diagnostic reasoning.1 Our study intentionally adopted an image-only framework to quantify baseline behavior under controlled conditions and to isolate the effect of the prompt structure. Specifically, after an initial open-ended prompt (Q2), we introduced a minimal, targeted clue (“There is bleeding…”) (Q3) to test whether structured guidance influences performance.2 The substantial improvement observed with this guided prompt supports the correspondent’s emphasis on input conditions and prompt engineering.2 It also aligns with published radiology-focused research indicating that prompt optimization (including structured prompting and few-shot approaches) can meaningfully influence LLM outputs and utility.5
Third, regarding the reliance on one or two preselected slices and the absence of dynamic window/level adjustments, we agree this differs from the routine radiologic workflow, in which multi-slice review and interactive windowing are integral, especially for subtle hemorrhage and artifact discrimination.1 In our Methods section, we provided representative two-dimensional slices to approximate a best-case static-input scenario.2 We acknowledge that a workflow-faithful evaluation would ideally allow multi-slice correlation (or a full-series review) and window/level control. These priorities are also reflected in broader multimodal GPT-4V radiology evaluations that highlight sensitivity to input presentation and context handling.6
Finally, we strongly support the safety considerations highlighted by the correspondent.1 In our conclusion, we emphasized that the model is not suitable for autonomous radiologic interpretation and should be considered, at most, as a supervised adjunct within human-in-the-loop paradigms.2 This caution is consistent with the emerging radiology-related literature emphasizing that MLLMs that use direct image input have not yet reached a level appropriate for unsupervised clinical deployment.3, 4, 6
We thank the correspondent again for their insightful remarks, which closely align with the key implications of our findings and help frame a clear agenda for clinically meaningful and safe evaluation of multimodal language–vision models in radiology.


