Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captions

Cho, Minjun; Kim, Sungwoo; Choi, Dooho; Sung, Yunsick

Detailed Information

Cited 0 time in webofscience

Cited 0 time in scopus

Metadata Downloads

Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captionsopen access

Authors: Cho, Minjun; Kim, Sungwoo; Choi, Dooho; Sung, Yunsick

Issue Date: Mar-2025

Publisher: MDPI

Keywords: autonomous driving; adaptation optimization; domain adaptation; vision caption; vision language model (VLM)

Citation: Applied Sciences, v.15, no.7, pp 1 - 19

Pages: 19

Indexed: SCIE
SCOPUS

Journal Title: Applied Sciences

Volume: 15

Number: 7

Start Page: 1

End Page: 19

URI: https://scholarworks.dongguk.edu/handle/sw.dongguk/58231

DOI: 10.3390/app15073712

ISSN: 2076-3417
2076-3417

Abstract: Autonomous driving technology has advanced significantly. However, it is challenging to accurately generate captions for driving environment scenes, which involve dynamic elements such as vehicles, traffic signals, road conditions, weather, and the time of day. Capturing these elements accurately is important for improving situational awareness in autonomous systems. Driving environment scene captioning is an important part of generating driving scenarios and enhancing the interpretability of autonomous systems. However, traditional vision-language models struggle with domain adaptation since autonomous driving datasets with detailed captions of dashcam-recorded scenes are limited and they cannot effectively capture diverse driving environment factors. In this paper, we propose an enhanced method based on the bootstrapping language-image pre-training with frozen vision encoders and large language Model (BLIP-2) to optimize the domain adaptation by improving scene captioning in autonomous driving environments. It comprises two steps: (1) transforming structured dataset labels into descriptive captions in natural language using a large language model (LLM), and (2) optimizing Q-Former in a BLIP-2 module with low-rank adaptation (LoRA) to achieve efficient domain adaptation. The structured dataset labels are originally stored in JSON format, where driving environment scene factors are encoded as key-value pairs that represent attributes such as the object type, position, and state. Using the Large-Scale Diverse Driving Video Database (BDD-100K), our method significantly improves performance, achieving BLEU-4, CIDEr, and SPICE scores that were each approximately 1.5 times those for the baseline BLIP-2, respectively. Higher scores show the effectiveness of LoRA-based optimization and, hence, its suitability for autonomous driving applications. Our method effectively enhances accuracy, contextual relevance, and interpretability, contributing to improved scene understanding in autonomous driving systems.

Files in This Item: There are no files associated with this item.

Appears in Collections: ETC > 1. Journal Articles

Show full item record

qrcode

Related Researcher

Researcher Sung, Yunsick photo

Sung, Yunsick: College of Advanced Convergence Engineering (Department of Computer Science and Artificial Intelligence)

Read more

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

30, Pildong-ro 1-gil, Jung-gu, Seoul, 04620, Republic of Korea+82-2-2260-3114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Related Researcher

Altmetrics

Total Views & Downloads

BROWSE