Cited 0 time in
Enhancing safety of vision-language reasoning through model-to-model deliberation
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Kim, Sungwoo | - |
| dc.contributor.author | Lee, Yongjin | - |
| dc.contributor.author | Sung, Yunsick | - |
| dc.date.accessioned | 2025-10-20T05:00:14Z | - |
| dc.date.available | 2025-10-20T05:00:14Z | - |
| dc.date.issued | 2025-10 | - |
| dc.identifier.issn | 2199-4536 | - |
| dc.identifier.issn | 2198-6053 | - |
| dc.identifier.uri | https://scholarworks.dongguk.edu/handle/sw.dongguk/61861 | - |
| dc.description.abstract | Traditional vision-language models demonstrate strong performance in tasks such as image captioning and visual question answering, but they remain limited by issues such as hallucination, lack of self-correction, and shallow reasoning. These shortcomings compromise the safety, robustness, and consistency of their reasoning, particularly in ambiguous or high-stakes scenarios. In this paper, we propose three complementary frameworks aimed at enabling more trustworthy visual reasoning through structured deliberation. The first is the self-reflective reasoning single-agent framework, which facilitates iterative self-revision without requiring external supervision. The second is the structured debate agent framework, in which turn-based rebuttals between agents promote contrastive, multi-perspective refinement. The third is the progressive two-stage debate agent framework, which enables efficient yet accurate decision-making through model-to-model deliberation between smaller and larger agents. Experiments on the COCO dataset demonstrate that all three frameworks significantly enhance reasoning performance, achieving up to a 5.4% improvement in Intersection over Union (IoU) and over a 40% reduction in localization error compared to a single-pass baseline. Further evaluation across robustness (IoU), safety (self-revision rate, SRR), and consistency (consistency score, CS) confirms the effectiveness of multi-round, self-corrective, and multi-agent reasoning strategies. These results establish a practical path toward safer, more robust, and more interpretable vision-language models through lightweight, deliberative inference frameworks. | - |
| dc.language | 영어 | - |
| dc.language.iso | ENG | - |
| dc.publisher | Springer Nature Switzerland | - |
| dc.title | Enhancing safety of vision-language reasoning through model-to-model deliberation | - |
| dc.type | Article | - |
| dc.publisher.location | 스위스 | - |
| dc.identifier.doi | 10.1007/s40747-025-02093-3 | - |
| dc.identifier.scopusid | 2-s2.0-105018721331 | - |
| dc.identifier.wosid | 001590825400006 | - |
| dc.identifier.bibliographicCitation | Complex & Intelligent Systems, v.11, no.11 | - |
| dc.citation.title | Complex & Intelligent Systems | - |
| dc.citation.volume | 11 | - |
| dc.citation.number | 11 | - |
| dc.type.docType | Article | - |
| dc.description.isOpenAccess | Y | - |
| dc.description.journalRegisteredClass | scie | - |
| dc.description.journalRegisteredClass | scopus | - |
| dc.relation.journalResearchArea | Computer Science | - |
| dc.relation.journalWebOfScienceCategory | Computer Science, Artificial Intelligence | - |
| dc.subject.keywordAuthor | Vision language model (VLM) | - |
| dc.subject.keywordAuthor | Visual question answering (VQA) | - |
| dc.subject.keywordAuthor | Vision reasoning | - |
| dc.subject.keywordAuthor | Object detection | - |
| dc.subject.keywordAuthor | Debate | - |
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.
30, Pildong-ro 1-gil, Jung-gu, Seoul, 04620, Republic of Korea+82-2-2260-3114
Copyright(c) 2023 DONGGUK UNIVERSITY. ALL RIGHTS RESERVED.
Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.
