Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Enhancing safety of vision-language reasoning through model-to-model deliberationopen access

Authors
Kim, SungwooLee, YongjinSung, Yunsick
Issue Date
Oct-2025
Publisher
Springer Nature Switzerland
Keywords
Vision language model (VLM); Visual question answering (VQA); Vision reasoning; Object detection; Debate
Citation
Complex & Intelligent Systems, v.11, no.11
Indexed
SCIE
SCOPUS
Journal Title
Complex & Intelligent Systems
Volume
11
Number
11
URI
https://scholarworks.dongguk.edu/handle/sw.dongguk/61861
DOI
10.1007/s40747-025-02093-3
ISSN
2199-4536
2198-6053
Abstract
Traditional vision-language models demonstrate strong performance in tasks such as image captioning and visual question answering, but they remain limited by issues such as hallucination, lack of self-correction, and shallow reasoning. These shortcomings compromise the safety, robustness, and consistency of their reasoning, particularly in ambiguous or high-stakes scenarios. In this paper, we propose three complementary frameworks aimed at enabling more trustworthy visual reasoning through structured deliberation. The first is the self-reflective reasoning single-agent framework, which facilitates iterative self-revision without requiring external supervision. The second is the structured debate agent framework, in which turn-based rebuttals between agents promote contrastive, multi-perspective refinement. The third is the progressive two-stage debate agent framework, which enables efficient yet accurate decision-making through model-to-model deliberation between smaller and larger agents. Experiments on the COCO dataset demonstrate that all three frameworks significantly enhance reasoning performance, achieving up to a 5.4% improvement in Intersection over Union (IoU) and over a 40% reduction in localization error compared to a single-pass baseline. Further evaluation across robustness (IoU), safety (self-revision rate, SRR), and consistency (consistency score, CS) confirms the effectiveness of multi-round, self-corrective, and multi-agent reasoning strategies. These results establish a practical path toward safer, more robust, and more interpretable vision-language models through lightweight, deliberative inference frameworks.
Files in This Item
There are no files associated with this item.
Appears in
Collections
ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Sung, Yunsick photo

Sung, Yunsick
College of Advanced Convergence Engineering (Department of Computer Science and Artificial Intelligence)
Read more

Altmetrics

Total Views & Downloads

BROWSE