Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Enhancing safety of vision-language reasoning through model-to-model deliberation

Full metadata record
DC Field Value Language
dc.contributor.authorKim, Sungwoo-
dc.contributor.authorLee, Yongjin-
dc.contributor.authorSung, Yunsick-
dc.date.accessioned2025-10-20T05:00:14Z-
dc.date.available2025-10-20T05:00:14Z-
dc.date.issued2025-10-
dc.identifier.issn2199-4536-
dc.identifier.issn2198-6053-
dc.identifier.urihttps://scholarworks.dongguk.edu/handle/sw.dongguk/61861-
dc.description.abstractTraditional vision-language models demonstrate strong performance in tasks such as image captioning and visual question answering, but they remain limited by issues such as hallucination, lack of self-correction, and shallow reasoning. These shortcomings compromise the safety, robustness, and consistency of their reasoning, particularly in ambiguous or high-stakes scenarios. In this paper, we propose three complementary frameworks aimed at enabling more trustworthy visual reasoning through structured deliberation. The first is the self-reflective reasoning single-agent framework, which facilitates iterative self-revision without requiring external supervision. The second is the structured debate agent framework, in which turn-based rebuttals between agents promote contrastive, multi-perspective refinement. The third is the progressive two-stage debate agent framework, which enables efficient yet accurate decision-making through model-to-model deliberation between smaller and larger agents. Experiments on the COCO dataset demonstrate that all three frameworks significantly enhance reasoning performance, achieving up to a 5.4% improvement in Intersection over Union (IoU) and over a 40% reduction in localization error compared to a single-pass baseline. Further evaluation across robustness (IoU), safety (self-revision rate, SRR), and consistency (consistency score, CS) confirms the effectiveness of multi-round, self-corrective, and multi-agent reasoning strategies. These results establish a practical path toward safer, more robust, and more interpretable vision-language models through lightweight, deliberative inference frameworks.-
dc.language영어-
dc.language.isoENG-
dc.publisherSpringer Nature Switzerland-
dc.titleEnhancing safety of vision-language reasoning through model-to-model deliberation-
dc.typeArticle-
dc.publisher.location스위스-
dc.identifier.doi10.1007/s40747-025-02093-3-
dc.identifier.scopusid2-s2.0-105018721331-
dc.identifier.wosid001590825400006-
dc.identifier.bibliographicCitationComplex & Intelligent Systems, v.11, no.11-
dc.citation.titleComplex & Intelligent Systems-
dc.citation.volume11-
dc.citation.number11-
dc.type.docTypeArticle-
dc.description.isOpenAccessY-
dc.description.journalRegisteredClassscie-
dc.description.journalRegisteredClassscopus-
dc.relation.journalResearchAreaComputer Science-
dc.relation.journalWebOfScienceCategoryComputer Science, Artificial Intelligence-
dc.subject.keywordAuthorVision language model (VLM)-
dc.subject.keywordAuthorVisual question answering (VQA)-
dc.subject.keywordAuthorVision reasoning-
dc.subject.keywordAuthorObject detection-
dc.subject.keywordAuthorDebate-
Files in This Item
There are no files associated with this item.
Appears in
Collections
ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Sung, Yunsick photo

Sung, Yunsick
College of Advanced Convergence Engineering (Department of Computer Science and Artificial Intelligence)
Read more

Altmetrics

Total Views & Downloads

BROWSE