Loading...

Hierarchical concept score postprocessing and concept-wise normalization in cnn-based video event recognition

Soltanian, M ; Sharif University of Technology | 2019

855 Viewed
  1. Type of Document: Article
  2. DOI: 10.1109/TMM.2018.2844101
  3. Publisher: Institute of Electrical and Electronics Engineers Inc , 2019
  4. Abstract:
  5. This paper is focused on video event recognition based on frame level convolutional neural network (CNN) descriptors. Using transfer learning, the image trained descriptors are applied to the video domain to make event recognition feasible in scenarios with limited computational resources. After fine-tuning of the existing CNN concept score extractors, pretrained on ImageNet, the output descriptors of the different fully connected layers are employed as frame descriptors. The resulting descriptors are hierarchically postprocessed and combined with novel and efficient pooling and normalization methods. As major contributions of this paper to the video event recognition, we present a postprocessing scheme in which the hierarchy and the relative shortest distance of concepts in WordNet concept tree is taken into account to alleviate uncertainty of the resulting concept scores at the output of the CNN. Besides, we propose a concept-wise power law normalization method that outperforms the widely used power law normalization. The integration of these approaches results in a high performance average (max) pooling-based video event recognition. Compared to the average (max) pooling combined with the state-of-the-art normalization methods and fine-tuned support vector machine classification, the proposed processing scheme improves the event recognition accuracy in terms of mean average precision over the Columbia consumer video and unstructured social activity attribute datasets, where achieves a pretty comparable result on UCF101 and ActivityNet datasets. © 1999-2012 IEEE
  6. Keywords:
  7. ActivityNet dataset ; Columbia consumer video dataset ; Max pooling ; Mean average precision ; Support vector machine ; Unstructured social activity attribute dataset ; WordNet tree ; Classification (of information) ; Convolution ; Feature extraction ; Flow visualization ; Forestry ; Image retrieval ; Job analysis ; Law enforcement ; Neural networks ; Ontology ; Personnel training ; Semantics ; Support vector machines ; Average pooling ; Consumer videos ; Convolutional neural network ; Event detection ; Max-pooling ; Social activities ; Task analysis ; UCF101 dataset ; Wordnet ; Video signal processing
  8. Source: IEEE Transactions on Multimedia ; Volume 21, Issue 1 , 2019 , Pages 157-172 ; 15209210 (ISSN)
  9. URL: https://ieeexplore.ieee.org/document/8382309