Loading...

Video Analysis based on Visual Events

Soltanian, Mohammad | 2019

938 Viewed
  1. Type of Document: Ph.D. Dissertation
  2. Language: Farsi
  3. Document No: 52364 (05)
  4. University: Sharif University of Technology
  5. Department: Electrical Engineering
  6. Advisor(s): Ghaemmaghami, Shahrokh
  7. Abstract:
  8. Recognition of complex visual events has attracted much interest in recent years. Compared to somehow similar tasks like action recognition, event recognition is much more complex, primarily because of huge intra-class variation of events, variable video durations, lack of pre-imposed video structures, and severe preprocessing noises. To deal with these complexities and improve the state of the art approaches to the problem of video understanding, this thesis is focused on video event recognition based on frame level CNN descriptors. Using transfer learning, the image trained descriptors are applied to the video domain to make event recognition feasible in scenarios with limited computational resources. After fine-tuning of Convolutional Neural Network (CNN) concept score extractors, the output descriptors of the different fully connected layers are employed as frame descriptors. The resulting descriptors are hierarchically post-processed, encoded, and combined with novel and efficient pooling and normalization methods. As the first major contribution of this work to the video event recognition, we present a post-processing scheme in which the hierarchy and the relative semantic distance of concepts is taken into account to alleviate uncertainty of the resulting concept scores at the output of the CNN. As the second main contribution, we propose a conceptwise power law normalization (CPN) method that outperforms the widely used power law normalization (PN). The next major contribution of this thesis is on incorporation of temporal information in video descriptor coding. In encoding of video descriptors, structural incorporation of visual temporal clues to the encoding process is often ignored, resulting in reduced recognition accuracy. So, we propose a spatio-temporal video encoding method which improves the trade-off between computational complexity and accuracy in a video event recognition task. The temporal dimension of video signals is utilized to construct a spatio-temporal vector of locally aggregated descriptors (VLAD) encoding scheme. The proposed encoding is shown to be in the form of a convex optimization problem which is analytically solved and a closed form solution is derived. The next main outcome of this thesis is a video encoding method which considers the soft assignment VLAD (SAVLAD) as the basis and applies sparsity of the descriptors in difference domain to achieve a high performance spatio-temporal video encoding. The proposed encoding is in the form of a generalized fused LASSO problem, which will be solved with alternating direction method of multipliers (ADMM). Compared to the state of the art video event recognition schemes based on frame-level descriptors, the proposed methods achieve a better semantic video modeling. The proposed processing schemes improve the event recognition accuracy or are very competitive in terms of mean Average Precision (mAP) over the Columbia Consumer Video (CCV) and Unstructured Social Activity Attribute (USAA), UCF101, and ActivityNet datasets
  9. Keywords:
  10. Maximal Pooling ; Wordnet ; Convolutional Neural Network ; Columbia Consumer Video (CCV)Dataset ; Unstructured Social Attribiute (USAA)Dataset ; UCF101 Dataset ; Activity Net Dataset ; Support Vector Machine (SVM) ; Vector of Locally Aggregated Descriptor (VLAD) Encoding ; Convex Optimization ; Spatio-Temporal Encoding

 Digital Object List

 Bookmark

No TOC