Loading...

Describing Surveillance Videos Including Combined Activities using Various Sentences

Paryabi, Faezeh | 2024

4 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 56900 (05)
  4. University: Sharif University of Technology
  5. Department: Electrical Engineering
  6. Advisor(s): Behroozi, Hamid; Mohammadzadeh, Narjesolhoda
  7. Abstract:
  8. Surveillance systems play an important role in the modern world. Nowadays, CCTV cameras are installed in many places to monitor various events. These cameras produce video data in a very large volume and size. One of the main challenges in this field is analyzing the content of these videos and summarizing and storing them in compressed formats such as text to save storage space. With the advancement of computing tools and the success of deep learning algorithms in solving many problems such as object detection, human action recognition and machine translation, many efforts have been made to describe video content. Most of these methods have described open domain videos and a limited number have focused on specific areas such as surveillance videos. One of the main reasons is the limited number of datasets in this field, due to the privacy of people and the cost of the tagging process. Applying trained models using open domain videos provide very general descriptions of the scene, while in the analysis of surveillance videos, it is necessary to pay attention to some details. In this research, a method is proposed that, despite the lack of access to surveillance video-text datasets, provides detailed descriptions of pedestrians present in surveillance videos. In this method, first, people are identified using the object detection network, and an id is assigned to each of them by the tracking algorithm; Then, by employing pre-trained models for action recognition and person attribute recognition, their behavioral and appearance characteristics are extracted. Finally, by applying the sentence generation algorithm, a descriptive sentence for each individual is obtained in the output. A qualitative and quantitative comparison has been made between the proposed method in this research and methods that use language models to generate sentences and have been trained on open-domain video-text datasets. In the quantitative comparison, the performance of different methods in recognizing a set of actions and attributes of individuals has been evaluated. By using the presented method, the F1 score in recognizing sets of actions and attributes increases to 69.05 and 84.92, respectively. The results indicate that the proposed method, due to the use of separate modules for action and attribute recognition, provides more details compared to other methods
  9. Keywords:
  10. Video Captioning ; Object Detection ; Tracking ; Action Recognition ; Ip-Based Cameras ; Surveillance System ; Attribute Recognition

 Digital Object List