Detailed Information

Cited 0 time in webofscience Cited 1 time in scopus
Metadata Downloads

Diagnostic Accuracy and Clinical Value of a Domain-specific Multimodal Generative AI Model for Chest Radiograph Report Generation

Authors
Hong, Eun KyoungHam, JiyeonRoh, ByungseokGu, JawookPark, BeomheeKang, SunghunYou, KihyunEom, JihwanBae, ByeongukJo, Jae-BockSong, Ok KyuBae, WoongLee, Ro WoonSuh, Chong HyunPark, Chan HoChoi, Seong JunPark, Jai SoungPark, Jae-HyeongJeon, Hyun JeongHong, Jeong-HoCho, DosangChoi, Han SeokKim, Tae Hee
Issue Date
Mar-2025
Publisher
Radiological Society of North America
Keywords
Pandas Version 2.1.1; Scipy Version 1.11.3; Algorithm; Area Under The Curve; Article; Artificial Intelligence; Atelectasis; Computer Assisted Tomography; Controlled Study; Cross Validation; Deep Learning; Diagnostic Accuracy; Diagnostic Test Accuracy Study; Follow Up; Fracture; Human; Hyperinflation; Image Quality; Lung Edema; Lung Lesion; Machine Learning; Multicenter Study; Outcome Assessment; Pleura Effusion; Pneumothorax; Prediction; Predictive Value; Radiologist; Receiver Operating Characteristic; Retrospective Study; Sensitivity And Specificity; Subcutaneous Emphysema; Thorax Radiography; Training; Adult; Aged; Clinical Trial; Computer Assisted Diagnosis; Female; Male; Middle Aged; Procedures; Reproducibility; Adult; Aged; Artificial Intelligence; Female; Humans; Male; Middle Aged; Radiographic Image Interpretation, Computer-assisted; Radiography, Thoracic; Reproducibility Of Results; Retrospective Studies; Sensitivity And Specificity
Citation
Radiology, v.314, no.3
Indexed
SCIE
SCOPUS
Journal Title
Radiology
Volume
314
Number
3
URI
https://scholarworks.dongguk.edu/handle/sw.dongguk/58229
DOI
10.1148/radiol.241476
ISSN
0033-8419
1527-1315
Abstract
Background Generative artificial intelligence (AI) is anticipated to alter radiology workflows, requiring a clinical value assessment for frequent examinations like chest radiograph interpretation. Purpose To develop and evaluate the diagnostic accuracy and clinical value of a domain-specific multimodal generative AI model for providing preliminary interpretations of chest radiographs. Materials and Methods For training, consecutive radiograph-report pairs from frontal chest radiography were retrospectively collected from 42 hospitals (2005-2023). The trained domain-specific AI model generated radiology reports for the radiographs. The test set included public datasets (PadChest, Open-i, VinDr-CXR, and MIMIC-CXR-JPG) and radiographs excluded from training. The sensitivity and specificity of the model-generated reports for 13 radiographic findings, compared with radiologist annotations (reference standard), were calculated (with 95% CIs). Four radiologists evaluated the subjective quality of the reports in terms of acceptability, agreement score, quality score, and comparative ranking of reports from (a) the domain-specific AI model, (b) radiologists, and (c) a general-purpose large language model (GPT-4Vision). Acceptability was defined as whether the radiologist would endorse the report as their own without changes. Agreement scores from 1 (clinically significant discrepancy) to 5 (complete agreement) were assigned using RADPEER; quality scores were on a 5-point Likert scale from 1 (very poor) to 5 (excellent). Results A total of 8 838 719 radiograph-report pairs (training) and 2145 radiographs (testing) were included (anonymized with respect to sex and gender). Reports generated by the domain-specific AI model demonstrated high sensitivity for detecting two critical radiographic findings: 95.3% (181 of 190) for pneumothorax and 92.6% (138 of 149) for subcutaneous emphysema. Acceptance rate, evaluated by four radiologists, was 70.5% (6047 of 8680), 73.3% (6288 of 8580), and 29.6% (2536 of 8580) for model-generated, radiologist, and GPT-4Vision reports, respectively. Agreement scores were highest for the model-generated reports (median = 4 [IQR, 3-5]) and lowest for GPT-4Vision reports (median = 1 [IQR, 1-3]; P < .001). Quality scores were also highest for the model-generated reports (median = 4 [IQR, 3-5]) and lowest for the GPT-4Vision reports (median = 2 [IQR, 1-3]; P < .001). From the ranking analysis, model-generated reports were most frequently ranked the highest (60.0%; 5146 of 8580), and GPT-4Vision reports were most frequently ranked the lowest (73.6%; 6312 of 8580). Conclusion A domain-specific multimodal generative AI model demonstrated potential for high diagnostic accuracy and clinical value in providing preliminary interpretations of chest radiographs for radiologists. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Little in this issue.
Files in This Item
There are no files associated with this item.
Appears in
Collections
Graduate School > Department of Medicine > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Altmetrics

Total Views & Downloads

BROWSE