Multimodal Food Image Classification with Large Language Models

  • Kim, Jun-Hwa
  • Kim, Nam-Ho
  • Jo, Donghyeok
  • Won, Chee Sun
Citations

WEB OF SCIENCE

2
Citations

SCOPUS

10

초록

In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LLM are encoded and combined with image features obtained from a transformer-based architecture to improve food image classification. Our approach employs a cross-attention mechanism to effectively fuse visual and textual modalities, enhancing the model's ability to extract discriminative features beyond what can be achieved with visual features alone.

키워드

food image classificationfine-grained visual classificationmultimodal image featurelarge language modeldeep learning
제목
Multimodal Food Image Classification with Large Language Models
저자
Kim, Jun-HwaKim, Nam-HoJo, DonghyeokWon, Chee Sun
DOI
10.3390/electronics13224552
발행일
2024-11
유형
Article
저널명
Electronics
13
22
페이지
1 ~ 10