Spatio-temporal VLAD encoding of visual events using temporal ordering of the mid-level deep semantics

Soltanian, M; Amini, S Ghaemmaghami, S Sharif University of Technology

Please enable javascript in your browser.

Spatio-temporal VLAD encoding of visual events using temporal ordering of the mid-level deep semantics

Soltanian, M ; Sharif University of Technology | 2020

596 Viewed

Type of Document: Article
DOI: 10.1109/TMM.2019.2959426
Publisher: Institute of Electrical and Electronics Engineers Inc , 2020
Abstract:
Classification of video events based on frame-level descriptors is a common approach to video recognition. In the meanwhile, proper encoding of the frame-level descriptors is vital to the whole event classification procedure. While there are some pretty efficient video descriptor encoding methods, temporal ordering of the descriptors is often ignored in these encoding algorithms. In this paper, we show that by taking into account the temporal inter-frame dependencies and tracking the chronological order of video sub-events, accuracy of event recognition is further improved. First, the frame-level descriptors are extracted using convolutional neural networks (CNNs) pre-trained on ImageNet, which are fine-tuned on a portion of training video frames. Then, a spatio-temporal encoding is applied to the derived descriptors. The proposed spatio-temporal encoding, as the main contribution of this work, is inspired from the well-known vector of locally aggregated descriptors (VLAD) encoding in spatial domain and from total variation de-noising (TVD) in temporal domain. The proposed unified spatio-temporal encoding is then shown to be in the form of a convex optimization problem which is solved efficiently with alternating direction method of multipliers (ADMM) algorithm. The experimental results show superiority of the proposed encoding method in terms of recognition accuracy over both frame-level video encoding approaches and spatio-temporal video representations. As compared to the state-of-the-art approaches, our encoding method improves the mean average precision (mAP) over both Columbia consumer video (CCV), unstructured social activity attribute (USAA), YouTube-8M, and Kinetics datasets and is very competitive on FCVID dataset. © 1999-2012 IEEE
Keywords:
Columbia consumer video (CCV) ; Convolutional neural network ; FCVID ; kinetics vector of locally aggregated descriptors ; Projected gradient descent ; Support vector machine ; Unstructured social activity attribute (USAA) ; YouTube-8M ; Convex optimization ; Convolutional neural networks ; Encoding (symbols) ; Semantics ; Signal encoding ; Alternating direction method of multipliers ; Chronological order ; Convex optimization problems ; Event classification ; Inter-frame dependency ; Recognition accuracy ; State-of-the-art approach ; Vector of locally aggregated descriptors ; Video signal processing
Source: IEEE Transactions on Multimedia ; Volume 22, Issue 7 , 2020 , Pages 1769-1784
URL: https://ieeexplore.ieee.org/document/8931623

Friend's email
Your name
Your email
enter code