Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captionsopen access

Authors
Cho, MinjunKim, SungwooChoi, DoohoSung, Yunsick
Issue Date
Mar-2025
Publisher
MDPI
Keywords
autonomous driving; adaptation optimization; domain adaptation; vision caption; vision language model (VLM)
Citation
Applied Sciences, v.15, no.7, pp 1 - 19
Pages
19
Indexed
SCIE
SCOPUS
Journal Title
Applied Sciences
Volume
15
Number
7
Start Page
1
End Page
19
URI
https://scholarworks.dongguk.edu/handle/sw.dongguk/58231
DOI
10.3390/app15073712
ISSN
2076-3417
2076-3417
Abstract
Autonomous driving technology has advanced significantly. However, it is challenging to accurately generate captions for driving environment scenes, which involve dynamic elements such as vehicles, traffic signals, road conditions, weather, and the time of day. Capturing these elements accurately is important for improving situational awareness in autonomous systems. Driving environment scene captioning is an important part of generating driving scenarios and enhancing the interpretability of autonomous systems. However, traditional vision-language models struggle with domain adaptation since autonomous driving datasets with detailed captions of dashcam-recorded scenes are limited and they cannot effectively capture diverse driving environment factors. In this paper, we propose an enhanced method based on the bootstrapping language-image pre-training with frozen vision encoders and large language Model (BLIP-2) to optimize the domain adaptation by improving scene captioning in autonomous driving environments. It comprises two steps: (1) transforming structured dataset labels into descriptive captions in natural language using a large language model (LLM), and (2) optimizing Q-Former in a BLIP-2 module with low-rank adaptation (LoRA) to achieve efficient domain adaptation. The structured dataset labels are originally stored in JSON format, where driving environment scene factors are encoded as key-value pairs that represent attributes such as the object type, position, and state. Using the Large-Scale Diverse Driving Video Database (BDD-100K), our method significantly improves performance, achieving BLEU-4, CIDEr, and SPICE scores that were each approximately 1.5 times those for the baseline BLIP-2, respectively. Higher scores show the effectiveness of LoRA-based optimization and, hence, its suitability for autonomous driving applications. Our method effectively enhances accuracy, contextual relevance, and interpretability, contributing to improved scene understanding in autonomous driving systems.
Files in This Item
There are no files associated with this item.
Appears in
Collections
ETC > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Sung, Yunsick photo

Sung, Yunsick
College of Advanced Convergence Engineering (Department of Computer Science and Artificial Intelligence)
Read more

Altmetrics

Total Views & Downloads

BROWSE