Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captionsopen access
- Authors
- Cho, Minjun; Kim, Sungwoo; Choi, Dooho; Sung, Yunsick
- Issue Date
- Mar-2025
- Publisher
- MDPI
- Keywords
- autonomous driving; adaptation optimization; domain adaptation; vision caption; vision language model (VLM)
- Citation
- Applied Sciences, v.15, no.7, pp 1 - 19
- Pages
- 19
- Indexed
- SCIE
SCOPUS
- Journal Title
- Applied Sciences
- Volume
- 15
- Number
- 7
- Start Page
- 1
- End Page
- 19
- URI
- https://scholarworks.dongguk.edu/handle/sw.dongguk/58231
- DOI
- 10.3390/app15073712
- ISSN
- 2076-3417
2076-3417
- Abstract
- Autonomous driving technology has advanced significantly. However, it is challenging to accurately generate captions for driving environment scenes, which involve dynamic elements such as vehicles, traffic signals, road conditions, weather, and the time of day. Capturing these elements accurately is important for improving situational awareness in autonomous systems. Driving environment scene captioning is an important part of generating driving scenarios and enhancing the interpretability of autonomous systems. However, traditional vision-language models struggle with domain adaptation since autonomous driving datasets with detailed captions of dashcam-recorded scenes are limited and they cannot effectively capture diverse driving environment factors. In this paper, we propose an enhanced method based on the bootstrapping language-image pre-training with frozen vision encoders and large language Model (BLIP-2) to optimize the domain adaptation by improving scene captioning in autonomous driving environments. It comprises two steps: (1) transforming structured dataset labels into descriptive captions in natural language using a large language model (LLM), and (2) optimizing Q-Former in a BLIP-2 module with low-rank adaptation (LoRA) to achieve efficient domain adaptation. The structured dataset labels are originally stored in JSON format, where driving environment scene factors are encoded as key-value pairs that represent attributes such as the object type, position, and state. Using the Large-Scale Diverse Driving Video Database (BDD-100K), our method significantly improves performance, achieving BLEU-4, CIDEr, and SPICE scores that were each approximately 1.5 times those for the baseline BLIP-2, respectively. Higher scores show the effectiveness of LoRA-based optimization and, hence, its suitability for autonomous driving applications. Our method effectively enhances accuracy, contextual relevance, and interpretability, contributing to improved scene understanding in autonomous driving systems.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - ETC > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.