Loading...

Hierarchical concept score post-processing and concept-wise normalization in CNN based video event recognition

Soltanian, M ; Sharif University of Technology | 2018

1152 Viewed
  1. Type of Document: Article
  2. DOI: 10.1109/TMM.2018.2844101
  3. Publisher: Institute of Electrical and Electronics Engineers Inc , 2018
  4. Abstract:
  5. This paper is focused on video event recognition based on frame level CNN descriptors. Using transfer learning, the image trained descriptors are applied to the video domain to make event recognition feasible in scenarios with limited computational resources. After fine-tuning of the existing Convolutional Neural Network (CNN) concept score extractors, pre-trained on ImageNet, the output descriptors of the different fully connected layers are employed as frame descriptors. The resulting descriptors are hierarchically post-processed and combined with novel and efficient pooling and normalization methods. As major contributions of this work to the video event recognition, we present a post-processing scheme in which the hierarchy and the relative shortest distance of concepts in WordNet concept tree is taken into account to alleviate uncertainty of the resulting concept scores at the output of the CNN. Besides, we propose a concept-wise power law normalization (CPN) method that outperforms the widely used power law normalization (PN). The integration of these approaches results in a high performance average (max) pooling based video event recognition. Compared to the average (max) pooling combined with the state of the art normalization methods and fine-tuned support vector machine (SVM) classification, the proposed processing scheme improves the event recognition accuracy in terms of mean Average Precision (mAP) over the Columbia Consumer Video (CCV) and Unstructured Social Activity Attribute (USAA) datasets, where achieves a pretty comparable result on UCF101 and ActivityNet datasets. IEEE
  6. Keywords:
  7. Columbia consumer video dataset ; Max pooling ; Support vector machine ; Training ; Unstructured social activity attribute dataset ; Visualization ; Wordnet tree ; Classification (of information) ; Convolution ; Feature extraction ; Flow visualization ; Forestry ; Image retrieval ; Job analysis ; Law enforcement ; Neural networks ; Ontology ; Personnel training ; Semantics ; Support vector machines ; Activitynet dataset ; Average pooling ; Consumer videos ; Convolutional neural network ; Event detection ; Max-pooling ; Mean average precision ; Social activities ; Task analysis ; UCF101 dataset ; Wordnet ; Video signal processing
  8. Source: IEEE Transactions on Multimedia ; Volume: 21 , Issue: 1 , Jan , 2019 , 157 - 172 ; 15209210 (ISSN)
  9. URL: https://ieeexplore.ieee.org/document/8382309