Multimodal Food Image Classification with Large Language Modelsopen access
- Authors
- Kim, Jun-Hwa; Kim, Nam-Ho; Jo, Donghyeok; Won, Chee Sun
- Issue Date
- Nov-2024
- Publisher
- MDPI
- Keywords
- food image classification; fine-grained visual classification; multimodal image feature; large language model; deep learning
- Citation
- Electronics, v.13, no.22, pp 1 - 10
- Pages
- 10
- Indexed
- SCIE
SCOPUS
- Journal Title
- Electronics
- Volume
- 13
- Number
- 22
- Start Page
- 1
- End Page
- 10
- URI
- https://scholarworks.dongguk.edu/handle/sw.dongguk/56353
- DOI
- 10.3390/electronics13224552
- ISSN
- 2079-9292
2079-9292
- Abstract
- In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LLM are encoded and combined with image features obtained from a transformer-based architecture to improve food image classification. Our approach employs a cross-attention mechanism to effectively fuse visual and textual modalities, enhancing the model's ability to extract discriminative features beyond what can be achieved with visual features alone.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - College of Engineering > Department of Electronics and Electrical Engineering > 1. Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.