Video Instance Segmentation via Spatio-temporal Embedding and Clustering

Arefi, Farnoosh; Kasaei, Shohreh

Please enable javascript in your browser.

Video Instance Segmentation via Spatio-temporal Embedding and Clustering

Arefi, Farnoosh | 2024

0 Viewed

Type of Document: Ph.D. Dissertation
Language: Farsi
Document No: 57823 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Kasaei, Shohreh
Abstract:
Video Instance Segmentation is one of the newest tasks in computer vision, tasked with segmenting, categorizing, and tracking instances across video frames. This task is highly significant and applicable today in industries such as autonomous vehicles, surveillance systems, production lines, and medical video analysis. Generally, there are two approaches for solving the task of Video Instance Segmentation: the object-oriented approach and the pixel-oriented approach. In the object-oriented approach, after detecting instances at the image level, the segmentation and tracking processes are performed to link the instances. In the pixel-oriented approach, all spatial-temporal information is utilized simultaneously, gradually predicting a mask in space-time from the pixel level. Previous methods have often sought to improve the performance of proposed models through fully-supervised training and the costly utilization of transformer models' representational capabilities. Given the costs and limitations associated with fully-supervised labeling and the growing importance of scalable and economical solutions in this task, the shift toward weakly-supervised methods and enhancing the loss function's performance to reduce representational costs is becoming increasingly prominent. The proposed method in this research is based on a pixel-oriented approach, focusing on modeling the consistency of embedding vectors through weakly-supervised training. In this method, a novel mechanism for embedding vector consistency is proposed using a bottom-up and discriminative approach. The proposed consistency modeling is based on two aspects: instance deformation and embedding vector discrimination. To model the deformation aspect, each instance's deformation is described over time using prior information and the description of key regions of the instance. Additionally, to model the discrimination aspect, a semi-supervised criterion based on a support vector is designed to measure the discrimination of embedding vectors within clusters. This ultimately enables network parameters to converge toward producing meaningful instance-level embeddings by computing this feedback. Dividing the consistency problem into smaller sub-problems and dynamically optimizing these sub-problems has led to improved segmentation performance over time, while enabling interpretability and controllability of the problem on different datasets. The proposed method achieved accuracies of 48.0, 49.8, and 56.7 on the standard YouTube-VIS-2019 dataset using R50, R101, and SwinL backbones, respectively, based on the AP metric. This approach significantly narrowed the gap between weakly-supervised and fully-supervised methods. Additionally, this method provides comprehensive analyses of the proposed mechanisms and their performance in addressing various challenges
Keywords:
Deep Neural Networks ; Spatio-Temporal Embeddings ; Video Instance Segmentation ; Clustering ; Computer Vision

Digital Object List

محتواي کتاب
view

Bookmark

Friend's email
Your name
Your email
enter code