Multimodal Food Image Classification with Large Language Models

Kim, Jun-Hwa; Kim, Nam-Ho; Jo, Donghyeok; Won, Chee Sun

doi:10.3390/electronics13224552

상세 보기

Multimodal Food Image Classification with Large Language Models

Kim, Jun-Hwa;
Kim, Nam-Ho;
Jo, Donghyeok;
Won, Chee Sun

Citations

WEB OF SCIENCE

2

Citations

SCOPUS

10

초록

In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LLM are encoded and combined with image features obtained from a transformer-based architecture to improve food image classification. Our approach employs a cross-attention mechanism to effectively fuse visual and textual modalities, enhancing the model's ability to extract discriminative features beyond what can be achieved with visual features alone.

키워드

food image classification; fine-grained visual classification; multimodal image feature; large language model; deep learning

제목: Multimodal Food Image Classification with Large Language Models

저자: Kim, Jun-Hwa; Kim, Nam-Ho; Jo, Donghyeok; Won, Chee Sun

DOI: 10.3390/electronics13224552

발행일: 2024-11

유형: Article

저널명: Electronics

권: 13

호: 22

페이지: 1 ~ 10