상세 보기
초록
In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LLM are encoded and combined with image features obtained from a transformer-based architecture to improve food image classification. Our approach employs a cross-attention mechanism to effectively fuse visual and textual modalities, enhancing the model's ability to extract discriminative features beyond what can be achieved with visual features alone.
키워드
food image classification; fine-grained visual classification; multimodal image feature; large language model; deep learning
- 제목
- Multimodal Food Image Classification with Large Language Models
- 저자
- Kim, Jun-Hwa; Kim, Nam-Ho; Jo, Donghyeok; Won, Chee Sun
- 발행일
- 2024-11
- 유형
- Article
- 저널명
- Electronics
- 권
- 13
- 호
- 22
- 페이지
- 1 ~ 10