Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning

논문 Review/Video Representations learning 2022. 5. 4. 12:54

Base로 잡으려고 하는 논문들 중 하나이다.

audio-visual representations learning 이다. Video의 내에서 Audio-Visual Association을 이용해

Self-Supervised learning을 하려고 하는데 이때의 방법들이 좀 필요해서 찾아보고 있다.

AAAI 2021에 accepted된 논문이다. 아쉽게도 코드는 제공하고 있지 않은 것 같다.

1. Introduction

acoustic signal 과 visual apperance의 co-occurrence는 인간의 경험에서 잠재적인 cue가 될 수 있다고 합니다.

에를 들어 ball bouncing 소리를 들으면 인간은 numerous visual scence candidates로부터 basketball game의 시나리오를 match 할 수 있습니다. machine model의 경우 이러한 inhernet 하고 pervasive한 correspondence는 audio-visual representation learning을 조사하고 다른 audio-visual message와의 복잡한 correlations을 발견함으로써 인간과 유사한 능력을 보유할 가능성을 높입니다. expensive human-annotation와 대조적으로 audiovisual message는 self-supervision learning을 이용하는 pervasive supervised signal이 될 수 있다. 그리고 large-scale unlabeled data를 multi-modality network로 co-training도 가능합니다.

audio-visual representations learning을 two type으로 나눌 수 있습니다.

1. Audio-VIsual Correspondence(AVC)

2. Audio-Visual Synchronization(AVS)

두 타입은 주로 verification task에서 audio랑 video clip의 input pair가 일치한지 아닌지를 예측하는 작업을 합니다. positive audio and video pairs는 보통 같은 비디오로부터 sampled됩니다.

두 방법의 주된 차이는 어떻게 negatvie audio and video pair을 다루냐입니다.

AVC - Negative pair을 다른 video로부터 다룸

AVS - Negatvie pair을 같은 video로부터 negative audio와 video 사이의 misalignments로 설정

기존의 방법들인 cross -modal knolwedge transfer나 two-stream audio-visual model은 semantic representations learning을 위한 two modalities 사이의 공유되는 information을 고려하지만, 동시에 여러 개의 오디오와 비디오 쌍의 중요한 cues는 무시합니다. 게다가 같은 modality의 data 분포를 모델링하기 위한 같은 modality의 유용한 정보를 잘 고려하지 않습니다.

이러한 문제점을 본 논문에서는 teacher-student learning 으로 audio와 visual 사이의 correspondence을 학습합니다.

teacher network는 student network 를 학습함으로써 unlabeled video에 semantic한 audio and visual represenations을 얻을 수 있고 이 때 contrastive learning 방식을 사용합니다. 여기서 또 이전의 비슷한 teacher and student knowledge trasfer 방식과의 차이점이 있는데,

이전 방식은 주로 teacher network의 intermediate representations이나 logits을 pairwise 방식으로 모방하는데 중점을 두었습니다. 하지만 같은 video로부터 derived되는 audio and video의 knowledge는 복잡한 네트워크에 요약된 완전한 knolwedge의 single 측면만 반영합니다. contrastive learning을 이용하여 self-supervision prediciton 에서 보다 richer한 structured knowledge를 포착 할 수 있습니다. 이 scheme이 student model이 어떻게 audio and visual objects가 concurrent한지에 대한 지식을 얻는데 도움이 될 뿐만 아니라 왜 일치하지 않는 노이즈가 있는 audio와 video 쌍이 일치하는 쌍과 다른지 밝혀줍니다. -> 이 부분 너무 막연한거 아닌가?.. 그냥 teacher-student network가 좋다 이거 같은데..

audio-visual scenes의 heterogeneous 복잡성에서 한 네트워크에서 다른 네트워크로 다이렉트로 information을 전달하는건 contrastive learning process를 악화 시킬 수 있습니다.

모델에 대한 설명을 하는데 자세한 건 뒤에서 하도록 하겠습니다. 개인적으로 가장 큰 특징은 Curriculum learning으로 teacher student를 바꿔가면서 학습하는게 의미가 있던 것 같습니다.

논문에서 말하는 main contribution입니다.

• We propose a self-supervised audio-visual modality transfer framework termed SSCL to explore more coherent knowledge from a teacher network to a student network, where contrastive learning is leveraged to capture the correspondence between audio and visual information.

• We develop a two-stage curriculum learning process to reason about multiple single-modality instances and distill cross-modal correction information. This process not only improves the overall distillation performance but also regularizes the teacher and student model to generalize on noisy and complex scenarios.

• We further apply the learned audio-visual representations to a variety of audio and visual downstream tasks. The extensive experiments verify the powerful audio-visual representations learned by our SSCL method, leading to the remarkable improvement of the performance on the downstream tasks compared with previous approaches

논문의 contribution은 항상 중요하게 생각해서 다 가지고 왔습니다.

2. Related Work

Self-Supervised Representation Learning of Audio-Visual Data

Visual이랑 audio 같이 쓰는 거에 대한 퍼포먼스 적 이득을 설명하고 Self-supervised 를 사용함으로써 많은 unlabeled data를 사용 할 수 있고 그러한 pretext task로 like colorization, rotation prediction 등으로 contrastive loss functions을 based로 사용하는 모델들입니다.

최근 work로 3가지 논문을 설명하는데

Co-training of audio and video representations from self-supervised temporal synchronization(NeurIPS 2018)

Self-supervised learning by cross-modal audio-video clustering(NeurlIPS 2020)

Music Gesture for Visual Sound Separation(CVPR 2021)

여기서 audio wave랑 visual object의 co-occurrence를 이용하고 downstream applications로는 sound classification, separation, localization, visual representation, synchronization 등에 사용 됩니다.

특히 audio or visual 정보가 visual/audio model을 pre-training을 위한 supervision으로 유용하게 사용됩니다.

해당 논문은

early-fusion multi-sensory network가 video frames and audio가 temporaily 하게 aligned 하는지 예측하는 걸 학습하는 Audio-Visual Scene Analysis with Self-Supervised Multisensory Features(ECCV 2018)이 있고

two-stream network 구조에 attention 메커니즘을 합쳐서 sound source를 localize하는걸 develop 하는 On Attention Modules for Audio-Visual Synchronization(CVPR 2019) 논문이 있습니다.

Cross-Modal Learning and Distillation

video 안에 내재되어 있는 optical flow, visual, audio로부터의 modalities들이 representation learning에서 supervisory signal로 사용이 됩니다.

위의 cluster 논문 설명을 추가로 하는데 sound와 frame이 동일한 비디오의 것인지 여부를 예측해서 teacher supervision 없이 aduiovisual correspondence를 모델링한 방법이다.

이 부분은 세미나 때 논문 5~6개 정도 간단히 설명하면 될 것 같다.

3. Methodology

Overview of Our SSCL Approach

N sample에 video dataset을 정의하고 visual encoder, audio encoder 를 통해 unlabelled video clip을 processed해서 pair representations visual, audio representation을 만들어준다.

Our goal은 visual and audio encoder를 효과적으로 train 하는 것이다. 그리고 fiv랑 fja를 uni-modal representation를 갖게 한다.

4. Experiments

Experimental Setup

일반적인 Self-supervised learning의 common practice를 따르고 transfer-learning 해서 downstream을 evaluate합니다.

Visual representation - action recognition

Audio representation - sound recognition

Pre-training Dataset

audio-visual pre-training을 위해 standard dataset인 Kinetics-400을 모델의 pre-train 을 위해 unlabled banchmark로 이용한다.

Kinetics-400 dataset은 306.00 video clips으로 이루어져 있고 Youtube에서 따온거고 인간과 사물의 interaction뿐만 아니라 인간과 사물의 interaction을 포함한 400개의 human action classes를 다루고 있습니다.

Video and AUdio Encoder

Visual 이랑 Audio feature을 뽑는 모델로

video encoder - S3D

Audio encoder - 10-layers ResNet

긱긱의 feuatre들은 L2 Normalization에 의해 정규화된 128 dimension으로 embedding 되기 위해 512-D size의 two fully connected로 projected됩니다. 128-D embedding이 contrastive loss로 사용된다. 이 fair은 Contrastive bidirectional transformer for temporal representation learning 논문과 유사하다고 합니다.

Training Details

visual feature를 뽑기 위해 sliding step 4(약 3초)에서 비디오 클립의 16프레임을 샘플링하고 프레임 크기를 112*112 resolutions으로 reszie합니다. audio feature를 extraction 하기 위해 무작위로 2초의 오디오를 샘플링하고 128x128(128개의 주파수 대역으로 128회 단계) 크기의 로그 스펙트럼그램을 계산합니다. 모델은 linear warm-sup scheme로 초기 learning rate는 0.03으로 해서 SGD을 이용해 학습합니다. SGD weight decay는 10-5 이고 Momentum 0.9 입니다.

epochs은 200정도 배치의 크기는 8개의 GPU에 대한 실험으로 128로 설정되었습니다.

Negatvie pairs K를 16,384로 놓고 temperature parameter 는 0.07로 설정합니다.

Downstream Tasks

visual representation fv -> action recognition UCF-101 dataset, HMDB-51 datasets

audio reprensentation fa -> sound classification ESC-50, DCASE datasets

Evaluation of Audio-Visual Representation

Action Recognition

UCF-101 - 13K vidoes from 101 action classes

HMDB-51 - 7K videos from 51 action classes

두 데이터셋으로 fine-tuning을 합니다.

audio-visual pre-training process를 complete하면 visual model(S3D)로 initialize된 학습된 파라미터를 사용하는데 action recognition을 위해 last classification layer을 랜덤하게 initialize해줍니다. 그리고 추가적으로 temporal resolution이랑 spatial resolution의 효과를 파악하기 위해 다양한 input configuration으로 모델을 fine-tune 해줍니다.

이 task에 사용되는 실험 세팅이 variability가 크기 때문에 모든 조건을 똑같은 setting으로 하기 힘들어서 의미있는 비교를 위해 다양하게 실험하고 위 table에서 결과를 보여줍니다.

실험에 대한 결과로 5가지 정도를 얘기합니다.

1) larger unlabelled dataset으로 pre-trained된 모든 모델들은 모델을 처음부터 fully train 하는 base line 모델들과 비교하여 작은 데이터셋 분류에서 정확도가 크게 향상이 된다.

-> 이는 의미 있는 pretext task가 ConvNet에 대해 효과적인 initialization을 생성해서 performance boost를 보여준다는 걸 의미합니다.

2) 기존의 방법이 복잡한 visual encdoer를 사용하는데 이는 효과적이지 않고 계산적으로도 intratable하다고 합니다.

특히 이는 실제 시나리오에 모델을 deploy 할때 더 심하다고 합니다.

3) Compared with those self-supervision methods, our method yields better results.

-> 파라미터나 Flops를 보면 small video backbone으로 좋은 performance를 보여주는 걸 확인할 수 있고 그 만큼 제안딘 모델이 효과적이라고 얘기합니다.

4) stage-1부터 stage-2 까지 얻은 Accuracy는 curriculum 학습 효과를 이븢ㅇ하고 cross-modal training은 nature of acoustic and visual message의 동시성 특성으로 인해 강력한 self-supervision signal을 제공한다라고 설명합니다.

Sound Recognition

마지막 classification layer를 제외한 audio encoder(2D-ResNet10)을 fix하고 두 개의 sound classification datasets에서 Audio representation quality를 테스트합니다.

EDC-50 - 2000 audio clips from 50 balanced environment sound classes

DCASE - 10 balanced scene sound classes, 2000 audio clips

입력 오디오는 24kHZ sampling rate로 1초 이내에 랜덤 샘플링된 wave로 처리하고 128*128(time and frequency bands) 크기의 sampled된 wave로 스펙트럼으로 추출됩니다.

전체적인 결과를 보여주는 그림입니다.

1) Linear probing을 사용해도 fixed filter로부터의 audio representations은 random initialization로 ConvNet을 fully training 하는 것보다 성능이 우수하게 나옵니다.

2) visual case와 유사하게 audio representations또한 이전 work보다 성능이 우수합니다.

직관적으로 정교한 cross-modal knowledge transfer은 효율적으로 작동하고 audio-visual correspondence는 더 나은 uni-modal representations을 생성하는데 도움이 됩니다.

Further Analysis

teacher network에 대한 curriculm learning의 stage-1은 audio와 visual modalities사이의 noisy하고 irrelevant한 information이 transfer process에 영향을 미칠 수 있다는 가정에 기초합니다 .

이 가설을 증명하기 위한 실험을 진행하는데 그 내용이 아래 그림과 같습니다.

여기서 a를 보면 되는데 다음과 같은 결과를 얻을 수 있습니다.

1) audio and visual knowledge transfer을 다이렉트로 optimizing 하는건 noise 때문에 bad choice이다.

2) The two stages of knowledge transfer between audio and visual typically have better results

3) The transfer process is completed after stage-II results in better representations than stage-I due to the knowledge transfer between audio and visual.

-> stage-1 보다 2까지 진행한게 좋은 결과가 있는데 audio and visual 사이의 knowledge transfer로 인해

위 결과로부터 self instance discriminative property을 얻기 위한 withinmodality calibration 부족이 transfer에 나쁘다고 결론을 지을 수 있다. 이유는 좋은 visual representations은 visual feature similarities를 반영할수있는데 여기에는 이뿐만 아니라 또한 noisy pair feature를 영향을 악화시킬 수 있기 때문이다.

student와 teacher를 교환하는 역할을 하는 2단계가 더 나은 모델을 제작하는데 효과적임을 보여줍니다.

Analyses on Video Pre-training Strategy

curriculum learning 의 stage-1에서 다른 pretext task의 영향을 파악하기 위해 video representation에서 widley하게 사용 되는 self-supervised methods인 3DRotNet이랑 Clip-order를 비교합니다.

pre-training method에 대한 분석인데 동일하게 세팅하고 pretext-task 만 다르게 한것이고 process가 지속 될수록 논문에서 하는 방법인 contrastive learning이 좋은 결과를 보이는 걸 알 수 있고 이 이유로는 large scale negatvie sample이랑 효과적인 training scheme을 말합니다.

Conclusion

self-supervised 방식에 acoustic signal 이랑 visual appearance 사이의 close correlation을 explore하는 논문입니다.

teacher-student network 패러다임에서의 contrastive learning을 활용한 cross-modal knowledge transfer framework을 보여줍니다. two-stage self-supervised curiculum learning scheme이 논문에 핵심 method이고 실험이나 결과를 보여주면서 audio and visual modality사이의 공유된 knowledge는 supervisory 한 signal임을 알 수 있습니다.

저작자표시 변경금지 (새창열림)

'논문 Review > Video Representations learning' 카테고리의 다른 글

개인 연구 아이디어 정리 (0)	2022.12.19
Active Contrastive Learning Of Audio-Visual Representation (0)	2022.05.26
Representation Learning 필요한 개념 정리 (0)	2022.05.18
CVPR 2021 Video Representation Learning 정리 (0)	2022.04.07
Audio-Visual Instance Discrimination with Cross-Modal Agreement (0)	2022.03.22

ABOUT ME

SuHyeon Vision & Deep Learning SuHyeon Vision & Deep Learning

1. Introduction

2. Related Work

Self-Supervised Representation Learning of Audio-Visual Data

4. Experiments

Experimental Setup

Pre-training Dataset

Video and AUdio Encoder

Training Details

Evaluation of Audio-Visual Representation

Action Recognition

Sound Recognition

Conclusion

'논문 Review > Video Representations learning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. Introduction

2. Related Work

Self-Supervised Representation Learning of Audio-Visual Data

4. Experiments

Experimental Setup

Pre-training Dataset

Video and AUdio Encoder

Training Details

Evaluation of Audio-Visual Representation

Action Recognition

Sound Recognition

Conclusion

'논문 Review > Video Representations learning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바