Detailed Information

Cited 1 time in webofscience Cited 0 time in scopus
Metadata Downloads

Audio-Visual Action Recognition Using Transformer Fusion Networkopen access

Authors
Kim, Jun-HwaWon, Chee Sun
Issue Date
Feb-2024
Publisher
MDPI
Keywords
action recognition; multi modal; deep learning; video
Citation
Applied Sciences, v.14, no.3, pp 1 - 13
Pages
13
Indexed
SCIE
SCOPUS
Journal Title
Applied Sciences
Volume
14
Number
3
Start Page
1
End Page
13
URI
https://scholarworks.dongguk.edu/handle/sw.dongguk/20028
DOI
10.3390/app14031190
ISSN
2076-3417
2076-3417
Abstract
Our approach to action recognition is grounded in the intrinsic coexistence of and complementary relationship between audio and visual information in videos. Going beyond the traditional emphasis on visual features, we propose a transformer-based network that integrates both audio and visual data as inputs. This network is designed to accept and process spatial, temporal, and audio modalities. Features from each modality are extracted using a single Swin Transformer, originally devised for still images. Subsequently, these extracted features from spatial, temporal, and audio data are adeptly combined using a novel modal fusion module (MFM). Our transformer-based network effectively fuses these three modalities, resulting in a robust solution for action recognition.
Files in This Item
There are no files associated with this item.
Appears in
Collections
College of Engineering > Department of Electronics and Electrical Engineering > 1. Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Altmetrics

Total Views & Downloads

BROWSE