Multi-View Masked Autoencoder for General Image Representation

Ji, Seungbin; Han, Sangkwon; Rhee, Jongtae

Detailed Information

Cited 1 time in webofscience

Cited 1 time in scopus

Metadata Downloads

Multi-View Masked Autoencoder for General Image Representationopen access

Authors: Ji, Seungbin; Han, Sangkwon; Rhee, Jongtae

Issue Date: Nov-2023

Publisher: MDPI

Keywords: contrastive learning; deep learning; image representation learning; masked image modeling; self-supervised learning

Citation: Applied Sciences, v.13, no.22, pp 1 - 15

Pages: 15

Indexed: SCIE
SCOPUS

Journal Title: Applied Sciences

Volume: 13

Number: 22

Start Page: 1

End Page: 15

URI: https://scholarworks.dongguk.edu/handle/sw.dongguk/21918

DOI: 10.3390/app132212413

ISSN: 2076-3417
2076-3417

Abstract: Self-supervised learning is a method that learns general representation from unlabeled data. Masked image modeling (MIM), one of the generative self-supervised learning methods, has drawn attention for showing state-of-the-art performance on various downstream tasks, though it has shown poor linear separability resulting from the token-level approach. In this paper, we propose a contrastive learning-based multi-view masked autoencoder for MIM, thus exploiting an image-level approach by learning common features from two different augmented views. We strengthen the MIM by learning long-range global patterns from contrastive loss. Our framework adopts a simple encoder-decoder architecture, thus learning rich and general representations by following a simple process: (1) Two different views are generated from an input image with random masking and by contrastive loss, we can learn the semantic distance of the representations generated by an encoder. By applying a high mask ratio, of 80%, it works as strong augmentation and alleviates the representation collapse problem. (2) With reconstruction loss, the decoder learns to reconstruct an original image from the masked image. We assessed our framework through several experiments on benchmark datasets of image classification, object detection, and semantic segmentation. We achieved 84.3% in fine-tuning accuracy on ImageNet-1K classification and 76.7% in linear probing, thus exceeding previous studies and showing promising results on other downstream tasks. The experimental results demonstrate that our work can learn rich and general image representation by applying contrastive loss to masked image modeling.

Files in This Item: There are no files associated with this item.

Appears in Collections: College of Engineering > Department of Industrial and Systems Engineering > 1. Journal Articles

Show full item record

qrcode

Altmetrics

Total Views & Downloads

RSS_1.0 RSS_2.0 ATOM_1.0

30, Pildong-ro 1-gil, Jung-gu, Seoul, 04620, Republic of Korea+82-2-2260-3114

Certain data included herein are derived from the © Web of Science of Clarivate Analytics. All rights reserved.
You may not copy or re-distribute this material in whole or in part without the prior written consent of Clarivate Analytics.

Detailed Information

Altmetrics

Total Views & Downloads

BROWSE