A Comprehensive Study of Deep Video Action Recognition

SooHyun2i 2022. 1. 6. 05:09

VIdeo Action Recognition에 대한 survey 논문입니다.

CVPR 2020에 accepted된 논문입니다. 연구주제로 domain은 video쪽으로 잡고 action recognition이라는 task를 공부하려고 합니다. 관련된 내용을 잘 모르기 때문에 먼저 survey 논문을 읽고 정리하면서 개념적인 공부나 트렌드가 어떻게 되어 가는 지 파악하고 정리하려고 합니다.

Abstract

Video action recognition은 video 이해를 위한 대표적인 task중 하나입니다.

요새 나오는 challenge가 modeling long-range temporal information in videos,

high computation costs, dataset이랑 evaluation protocol variance 때문에 나타나는 incomparable한 결과 같은 게 있습니다.

이 페이퍼에서는 200개의 deep learning 기반 video action recognition을 포괄적인 조사를 제공합니다.

deep learninㅎ으로부터 two-stream networks, 3D conv kernel 최근에 compute-efficient model까지 다양하게 제공합니다.

그리고 마지막에 새로운 연구 아이디어를 촉진하기 위해 미해결 문제에 대해 논의하고 Video action recognition의 기회를 조명합니다.

1. Introduction

video를 이해하는 task 중에 가장 중요한 것은 인간의 행동을 이해하는 것입니다.

behavior analysis나 video retrieval, human-robot interaction, gaming 이나 entertainment, 자율주행 자동차 등 다양한 곳에 사용이 될 수 있습니다. 비디오에서 인간의 행동을 인식하는 작업을 video action recognition이라고 합니다.

위 그림은 악수하고 자전거를 타는 것과 같은 일반적인 인간의 일상 활동인 관련 액션 라벨과 함께 여러 비디오 프레임을 시각화한 그림입니다.

인기있는 action recognition에 대한 것도 요약을 한 그림이 있는데 밑에 있는 fiagure 2입니다.

대규모 데이터셋의 가용성과 딥러닝의 빠른 발전 덕분에 비디오 동작을 인식하는 딥러닝 기반 모델도 빠르게 성장하고 있습니다.

최근 representative work를 연대순으로 보여주는 그림입니다.

DeepVideo가 video에 CNN을 적용해서 시도를 처음으로한 모델입니다. 여기서 3가지의 trend가 나오는데

1번째 trend는 Two-Stream Network, optical flow stream에서 CNN을 train시켜서 video에 대한 temporal한 information을 학습하는 두번째 path를 추가해주는 것입니다. 이러한 개념이 뒤에 TDD, LRCN, Fusion, TSN과 같은 것들에 영감을 준 아이디어입니다.

2번째 trend는 3D convolutional kernel을 사용해서 video에 temporal information을 모델링 하는 것입니다. I3D, R3D, S3D, non-locl, slowFast 같은 것들입니다.

마지막으로 3번째 trend는 실제 애플리케이션에 채택될 수 있도록 더 큰 데이터 세트로 확장하기 위한 계산 효율성에 초점을 맞췄다. Hidden TSN, TSM, X3D, TVN 같은 모델들입니다.

이 논문에서는

video action recognition을 위해 deep learning 200개의 paper을 리뷰합니다. 연도별로 설명하고 인기있는 paper는 좀 자세히 설명합니다.
정확성이랑 효율성 관점에서 같은 dataset에 널리 채택된 method를 benchmark합니다.
향후 연구를 촉진하기 위해 이 분야의 과제, 미해결 문제 및 기회에 대해 자세히 설명합니다

2. Datasets and Challenges

2.1. Datasets

video action recognition 케이스에서는 모델을 효과적으로 학습하기 위해 large-scale한 annotated datasets이 필요합니다. video action recognition task에서 dataset은 다음과 같은 process로 종종 built됩니다.

Define an action list, by combining labels from previous action recognition datasets and adding new categories depending on the use case
Obtain videos from various sources, such as YouTube and movies, by matching the video title/subtitle to the action list
Provide temporal annotations manually to indicate the start and end position of the action
clean up the dataset by de-duplication and filtering out noisy classes/samples

video action recognition에 사용되는 인기 있는 dataset list입니다.

2.2. Challenges

효과적으로 video action recognition 알고리즘을 개발하기 위해 주요 challenge가 있습니다.

dataset 관점으로 보면

first, defining the label space for training action recognition models is non-trivial.

왜냐하면 human action은 복잡한 개념이고 계층적으로 잘 정의가 되어있지 않은 개념이기 때문입니다.

Second, annotating videos for action recognition are laborious and ambiguous

주석을 다는 건 모든 video frame을 다 봐야하니까 어렵죠, 동작의 정확한 시작과 끝도 모호합니다.

Third, some popular benchmark datasets (e.g., Kinetics family) only release the video links for users to download and not the actual video, which leads to a situation that methods are evaluated on different data.

method를 공정하게 비교하고 insight를 얻는게 불가능하다고 합니다.

modeling 관점으로 보면

first, videos capturing human actions have both strong intra- and inter-class variations

사람들은 다양한 관점에서 다른 속도로 동일한 동작을 수행할 수 있고 게다가, 어떤 행동들은 구별하기 어려운 비슷한 움직임 패턴을 공유하기 때문입니다.

Second, recognizing human actions requires simultaneous understanding of both short-term action-specific motion information and long-range temporal information

단일 컨볼루션 신경망을 사용하는 것보다 다양한 관점을 처리할 수 있는 정교한 모델이 필요할 수 있습니다.

Third, the computational cost is high for both training and inference, hindering both the development and deployment of action recognition models

3. An Odyssey of Using Deep learning for Video Action Recognition

3.1. From hand-crafted features to CNNs

hand crafted feature는 무거운 computational cost를 가지고 있고 scale이랑 deploy도 어렵습니다.

그래서 연구자들이 CNN을 video problem에 적용하기 시작했는데 late,early,slow fusin와 같이 video action recognition에서 spatio-temporal feature를 학습하기 위해 여러 temporal connectivity pattern을 조사하고 각각의 video frame에 2D CNN model을 사용하는 것을 제안한게 DeepVideo 입니다. 비록 이 모델이 이후 multi-resolution network의 유용한? 유의미한 그런 progress를 제공하는 아이디어였지만 퍼포먼스는 hand-crafted IDT feature보다 20% 낮았습니다.

또한 Deep VIdeo는 input이 stack frame으로 바뀔 떄 개별 video frame으로 fed 한 network가 동일한 성능을 낸다고 찾았습니다. 이러한 observation은 spatio-temporal feature가 motion을 잘 capture하지 못한다는 것을 가리킵니다.

그래서 이러한 것이 computer vision task와 달리 video domain에서는 왜 CNN 모델이 전통적인 hand-crafted feature을 이기지 못하는지에 대한 이유를 알려줍니다.

3.2. Two-stream networks

video에서는 지관적으로 motion 정보가 필요하므로 frame간의 temporal 관계를 설명하는 방법을 찾는 것이 CNN 기반 video action recognition의 성능을 향상시키는데 필수적입니다.

Optical flow는 object/scene movement을 설명하는 효과적인 motion representation입니다.

정확히 말하자면 관찰자와 장면 사이의 상대적인 움직임에 의해 시각적인 장면에서 obejcets, surfaces, edge의 겉보기 움직임의 패턴입니다.

위 그림은 optical flow의 몇몇 visualization을 보여주는 그림입니다.

그림에서 볼 수 있듯이 optical flow는 정확하게 각각의 motion pattern을 묘사 할 수 있습니다.

optical flow을 사용하면서 얻을 수 있는 advantage는 RGB image와 비교하여 orthogonal한 information을 제공해주는 것입니다.

따라서 Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems (NeurIPS), 2014 논문에서 위 그림과 같이 spatial stream와 temporal stream을 포함하는 network을 제안합니다. spatial stream은 visual appearance 정보를 capture하기 위해 입력으로 raw video frame을 사용합니다. temporal stream은 video frame 사이에 motion 정보를 capture하기 위해 input으로 optical flow image을 쌓은 것을 사용합니다.

temporal stream을 추가해줌으로써 CNN-based approach가 이전의 hand-crafted feature IDT 이런것과 performance가 유사해졌습니다. 그리고 여기서 두가지 중요한 점이 관찰되었는데 첫번째는 motion information이 video action recognition에서 중요하다라는 점과 여전히 CNN으로 raw video frame으로부터 temporal 한 information을 학습하는것이 어렵다는 점입니다. motion representation으로 optical flow을 사전 계산하는것이 deep learning의 힘을 보여주는 것에 있어 효과적인 방법입니다.

그래서 hand crafted feature와 deep learning approach간의 gap을 줄였기 때문에 많은 two-stream network에 대한 후속 논문이 등장했고 video action recognition의 개발을 크게 발전시켰다.

3.2.1 Using deeper network architectures

Two-stream network는 상대적으로 shallow한 network 아키텍처를 사용했는데 그래서 자연스럽게 network를 좀 더 deep하게 해보자라는 접근이었습니다. 하지만 Limin Wang, Zhe Wang, Yuanjun Xiong, and Yu Qiao. CUHK and SIAT Submission for THUMOS15 Action Recognition Challenge. THUMOS’15 Action Recognition Challenge, 2015이 작은 크기의 비디오 데이터세트에 overfitting이 있기 때문에 단순히 더 깊은 network를 사용한다고 해서 더 나은 결과를 얻을 수 없다는 것을 발견했습니다. 그래서 overfitting을 막기 위해 crossmodality initialization, synchronized batch normalization,corner cropping and multi-scale cropping data augmentation, large dropout ratio등 다양한 우수 사례들을 소개합니다. 이러한 사례를 통해 UCF101에서 VGG16 모델로 two-stream network를 train할 수 있었습니다.

지금도 사용이 되고 있고 TSN는 VGG16, ResNet, Inception와 같은 network 아키텍처에 대한 조사를 수행했고 더 deep한 network가 일반적으로 video action recognition에서 더 높은 accuracy를 달성함을 입증했습니다.

3.2.2 Two-stream fusion

two-stream network에는 two stream이 있기 떄문에 final prediction에서 얻는 network의 결과를 합쳫야 합니다. 이런 stage를 보통 spatial-temporal fusion step이라고 부릅니다. 가장 쉽고 간단한 방법은 두 stream의 예측 가중 평균을 수행하는 late fusion입니다. 이게 널리 사용 되었지만 연구자들은 optimal way라고 주장하지 않았고 그들은 두 network사이의 초기 interaction이 모델 학습 동안 two stream에 도움이 될 수 있다고 생각했고 이를 early fusion이라고 합니다.

Convolutional Two-Stream Network Fusion for Video Action Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 논문은 early fusion이 두 stream에 모두 더 풍부한 기능을 학습하는 데 도움이 되며, late fusion에 비해 성능이 향상됩니다. ResNet을 이용한 논문도 나오고 이거에 base를 해서 residual network에 대한 multiplicative gating function을 추가로 제안한 논문도 나왔습니다. 이것은 다 더 나은 spatio-temporal feature을 학습하기 위함입니다.

3.2.3 Recurrent neural networks

비디오가 temporl sequence가 필수적이기 떄문에 연구자들은 RNN중에 특히 LSTM을 적용하는데 explor했습니다.

LRCN and Beyond-Short-Snipptes가 video action recognition에 two-stream network 세팅에 LSTM을 적용한 첫번째 페이퍼입니다.

5. Discussion and Future Work

5.1. Analysis and insights

What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 에서 video understanding에 temporal information의 효과를 수행했습니다. video for recognizing the action에서 motion이 얼마나 중요한가에 대한 대답을 해줬다.

5.2. Data augmentation

유용, SIMCLR 언급 많이 활용하자

5.3. Video domain adaptation
standard 한 dataset에 대한 정확도는 점점 높아지지만, 데이터 세트 또는 도메인에 걸친 현재 비디오 모델의 일반화 기능은 덜 탐구된다.

그러나 이러한 문헌은 몇 개의 중복 범주만 있는 소규모 비디오 DA에 초점을 맞추고 있으며, 이는 실제 도메인 불일치를 반영하지 않고 편향된 결론으로 이어질 수 있다. Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, Jaekwon Yoo, Ruxin Chen, and Jian Zheng. Temporal Attentive Alignment for Large-Scale Video Domain Adaptation 논문에서 비디오 DA를 조사하고 temporal dynamics을 정렬하는 것이 특히 유용하다는 것을 알아내기 위해 두 개의 대규모 데이터 세트를 도입한다. temporal misalignment 문제를 해결하기 위해 co-attention 개념도 채택했습니다(다른 논문)

computing resource가 적은 연구자들에게 특히 video DA는 유망한 방향입니다.

5.4. Neural architecture search

5.5.5 Efficient model development

정확성이 좋지만 실제로 application에서 video understanding 문제에서 deep learning method를 배포하는건 어렵습니다.

주요 challenges가 있는데 : (1) most methods are developed in offline settings, which means the input is a short video clip, not a video stream in an online setting; (2) most methods do not meet the real-time requirement; (3) incompatibility of 3D convolutions or other non-standard operators on non-GPU devices (e.g., edge devices).

5.6. New datasets

데이터셋은 long-term temporal modeling이 중요함. 새로운 dataset! 성능을 높이기 위해서

그래서 요새 대부분 dataset은 youtube에서 가져옵니다. 근데 유튜브가 최근에 single imp로부터 다운로딩하는 거를 block해놓아서 실제로 많은 연구자들이 이 분야 연구하는데에 데이터 없는 게 어려울 수 있고 region limitation이나 개인정보 이슈때문에 some video들이 이용가능하지 않습니다. 그래서 Kinetics400dataset 예를 들면 300Kvideo인데 실제 크롤링 해보면 280K video만 얻을 수 있다. 서로 다른 데이터에 대해 훈련하고 평가할 때 방법 간의 공정한 비교를 수행하는 것은 불가능하다라는게 문제입니다.

5.7. VIdeo adversarial attack

5.8 Zero-shot action recognition

ZSL의 목표는 이전에 보지 못했던 범주를 분류하기 위해 학습된 지식을 전달하는 것입니다.

대부분은 표준 프레임워크를 따르는데, 먼저 사전 훈련된 네트워크를 사용하여 비디오에서 시각적 특징을 추출한 다음 시각적 임베딩을 의미적 임베딩 공간에 매핑하는 공동 모델을 훈련합니다.

저작자표시 변경금지 (새창열림)