2022-TPAMI A Survey on Deep Learning Technique for Video Segmentation

motivation

The paper forcuses on two basic lines of research containing generic object segmentation(of unknown categories)
in videos
and video semantic segmentation. A github link was proposed for tracking the development
for the field, which is project(https://github.com/tfzhou/VS-Survey).

method

  1. Video segmentation tasks are mainly categorised to these taxonomies below:
  • AVOS(automatic video object segmentation)
  • SVOS(semi-automatic video object segmentation)
  • IVOS(interative video object segmentation)
  • LVOS(language-guided video object segmentation)
  • VSS(video semantic segmentation)
  • VIS(video instance segmentation, also termed “video multi-object segmenation”)
  • VPS(video panoptic segmentation)
  1. The differences between VOS and VSS:
  • VOS often focuses on human created media, which often have large camera motion, deformation,
    and appearance changes.
  • VSS instead often focuses on applications like autonomous driving, which requires a good
    trade off between accuracy and latency, accurate detection of small objects, model parallelization, and cross-domain
    generalization ability.
  1. Landmark efforts in video segmentation:
  • 3.1 deep learning based VOS methods

    • 3.1.1 AVOS

      • deep learning moudle based methods
      • Pixel Instance Embedding based methods
      • End-to-end methods with Short-term Information Encoding(mainstream)
        (1)RNNs;
        (2)Two-stream: raw image& optical flow;
      • End-to-end methods with Long-term Information Encoding
      • Un-/Weakly-supervised based methods(only a handful)
      • Instance-level AVOS methods
        Overall, current instance-level AVOS models follow the classic tracking-by-detection paradigm, involving
        several ad-hoc designs.

    • 3.1.2 SVOS
      Deep learning-based SVOS methods mainly focus on the first-frame mask propagation setting.

      • Online Fine-tuning based methods
        (1)step1: offline pretraining for general object feature
        (2)step2: online fine-tuning for specific object feature
      • Propagation-based methods
        It was found to easily suffer from error accumulation due to occlusions and drifts during mask propagation
      • Matching-based methods
        Maybe the most promising SVOS solution, which classifies each pixel’s label according to the similarity to the target object in embedding space.
        STM(space-time memory) model has been a star in the domain.
      • Box-initialization based methods
      • Un-/Weakly-supervised based methods
      • Other specific methods

    • 3.1.3 IVOS(Interactive video object segmentation)

    • 3.1.4 LVOS(Language-guided video object segmentation)

      • Dynamic Convolution-based methods
      • Capsule Routing-based methods
      • Attention-based methods

  • 3.2 Deep learning-based VSS methods

    • 3.2.1 (Instance-agnostic) video semantic segmentation
    • 3.2.2 Video Instance Segmentation(VIS)
    • 3.3.3 Video Panoptic segmenation(VPS)

  1. Datasets

  • 4.1 AVOS/SVOS/IVOS Datasets
    • YouTube-Objects
      It is a large dataset of 1,407 videos collected from 155 web videos belonging to 10 object categories(e.g., dog, cat, plane, etc.). VOS models typically test the
      generalization ability on a subset having totally 126 shots with 20,647 frames that provides coarse pixel-level fore-/background annotations on every 10th frames.
    • FBMS59
      It consists of 59 video sequences with 13,860 frames in total. However, only 720 frames are annotated for fore-/background separation. The dataset is split into 29 and
      30 sequences for training and evaluation, respectively.
    • DAVIS16
      It has 50 videos (30 for train set and 20 for val set) with 3,455 frames in total. For each frame, in addition to high-quality fore-/background segmentation annotation,
      a set of attributes (e.g., deformation, occlusion, motion blur, etc.) are also provided to highlight the main challenges.
    • DAVIS17
      It contains 150 videos, i.e., 60/30/30/30 videos for train/val/test-dev/test-challenge sets. Its train and val sets are extended from the respective sets in DAVIS16. There
      are 10,459 frames in total. DAVIS17 provides instance-level annotations to support SVOS. Then, DAVIS18 challenge provides scribble annotations to support IVOS. Moreover, as
      the original annotations of DAVIS17 are biased towards the SVOS scenario, DAVIS19 challenge re-annotates val and test-dev sets of DAVIS17 to support AVOS.
    • YouTube-VOS
      It contains 150 videos, i.e., 60/30/30/30 videos for train/val/test-dev/test-challenge sets. Its train and val sets are extended from the respective sets in DAVIS16. There
      are 10,459 frames in total. DAVIS17 provides instance-level annotations to support SVOS. Then, DAVIS18 challenge provides scribble annotations to support IVOS. Moreover, as
      the original annotations of DAVIS17 are biased towards the SVOS scenario, DAVIS19 challenge re-annotates val and test-dev sets of DAVIS17 to support AVOS.
    • Remark
      Youtube-Objects, FBMS59 and DAVIS16 are used for instance-agnostic AVOS and SVOS evaluation. DAVIS17 is unique in comprehensive annotations for instance-level
      AVOS, SVOS as well as IVOS, but its scale is relatively small. YouTube-VOS is the largest one but only supports SVOS benchmarking. There also exist some other VOS datasets,
      such as SegTrackV1 and SegTrackV2, but they were less used recently, due to the limited scale and difficulty.
  • 4.2 LVOS Datasets

    • A2D Sentence
    • J-HMDB Sentence
    • DAVIS17-RVOS
    • Refer-YouTube-VOS
  • 4.3 VSS Datasets

    • CamVid
      It is composed of 4 urban scene videos with 11-class pixelwise annotations. Each video is annotated every 30 frames. The annotated frames are usually grouped
      into 467/100/233 for train/val/test.
    • CityScapes
      It is a large-scale VSS dataset for street views. It has 2,975/500/1,525 snippets for train/val/ test, captured at 17FPS. Each snippet contains 30 frames,
      and only the 20th frame is densely labelled with 19 semantic classes. 20,000 coarsely annotated frames are also provided.
    • NYUDv2
      It contains 518 indoor RGB-D videos with high-quality ground-truths (every 10th video frame is labeled). There are 795 training frames and 654 testing frames
      being rectified and annotated with 40-class semantic labels.
    • VSPW
      It is a recently proposed large-scale VSS dataset. It addresses video scene parsing in the wild by considering diverse scenarios. It consists of 3,536 videos, and provides
      pixel-level annotations for 124 categories at 15FPS. The train/val/test sets contain 2,806/343/387 videos with 198,244/24,502/28,887 frames, respectively.
    • YouTube-VIS
      It is built upon YouTube-VOS with instance-level annotations. Its newest 2021 version has 3,859 videos (2,985/421/453 for train/val/test) with 40 semantic categories. It provides 232K high-quality annotations
      for 8,171 unique video instances.
    • KITTI MOTS
      It extends the 21 training sequences of KITTI tracking dataset with VIS annotations – 12 for training and 9 for validation, respectively. The dataset
      contains 8,008 frames with a resolution of 375×1242, 26,899 annotated cars and 11,420 annotated pedestrians.
    • MOTSChallenge
      It annotates 4 of 7 training sequences of MOTChallenge2017. It has 2,862 frames with 26,894 annotated pedestrians and presents many occlusion cases.
    • BDD100K
      It is a large-scale dataset with 100K driving videos (40 seconds and 30FPS each) and supports various tasks, including VSS and VIS. For VSS, 7,000/1,000/2,000
      frames are densely labelled with 40 semantic classes for train/val/test. For VIS, 90 videos with 8 semantic categories are annotated by 129K instance masks – 60 training
      videos, 10 validation videos, and 20 testing videos.
    • OVIS
      It is a new challenging VIS dataset, where object occlusions usually occur. It has 901 videos and 296K highquality instance masks for 25 semantic categories. It is split
      into 607 training, 140 validation and 154 test videos.
    • VIPER-VPS
      It re-organizes VIPER into the video panoptic format. VIPER, extracted from the GTA-V game engine, has annotations of semantic and instance segmentations for 10 thing and 13 stuff classes on 254K frames of
      ego-centric driving scenes at 1080⇥1920 resolution.
    • Cityscapes-VPS
      It is built upon CityScapes. Dense panoptic annotations for 8 thing and 11 stuff classes for 500 snippets in Cityscapes val set are provided every five
      frames and temporally consistent instance ids to the thing objects are also given, leading to 3000 annotated frames in total. These videos are split into 400/100 for train/val.
    • Remark
      CamVid, CityScapes, NYUDv2, and VSPW are built for VSS benchmarking. YouTube-VIS, OVIS, KITTI MOTS, and MOTSChallenge are VIS datasets, but the diversity of the last two are limited. BDD100K has both VSS and
      VIS annotations. VIPER-VPS and Cityscapes-VPS are aware of VPS evaluation, but VIPER-VPS is a synthesized dataset.

2022-TPAMI A Survey on Deep Learning Technique for Video Segmentation
https://firrice.github.io/posts/2024-07-12-1/
Author
firrice
Posted on
July 12, 2024
Licensed under