Vision Based Human Action Recognition using Deep Learning

Mokari, Mozhgan; Haj Sadeghi, Khosrow

Please enable javascript in your browser.

Vision Based Human Action Recognition using Deep Learning

Mokari, Mozhgan | 2025

0 Viewed

Type of Document: Ph.D. Dissertation
Language: Farsi
Document No: 58396 (05)
University: Sharif University of Technology
Department: Electrical Engineering
Advisor(s): Haj Sadeghi, Khosrow
Abstract:
Temporal action localization in untrimmed videos is one of the significant challenges in computer vision, as accurately identifying the temporal boundaries of actions and classifying them remains difficult, with no optimal solution proposed thus far. In this thesis, two innovative methods are introduced to address this challenge and enhance the performance of temporal action localization (TAL). The first method involves designing an end-to-end neural network that utilizes error estimation to achieve precise action localization. The proposed method enhances temporal localization and action classification by simultaneously optimizing the network structures. To improve the accuracy of temporal intervals, a regression-based module is innovatively incorporated as part of the proposed unified network to estimate time boundary errors and refine intervals. Evaluations conducted on the THUMOS 14 and ActivityNet-v1.3 datasets demonstrate the effectiveness of the proposed method, maintaining simplicity without requiring additional data or complex architectures. This improvement is particularly notable in intervals with high overlaps, which demand precise temporal estimates. The method exhibits exceptional performance in localizing challenging activities within the complex and diverse ActivityNet-v1.3 dataset. For instance, in the activity ”drinking coffee” the mean Average Precision (mAP) achieved is five times higher than the best-reported results. The second method introduces SeqAttNet, an optimized and efficient structure lever aging the capabilities of attention-based mechanisms, innovative 3D input aggregation, temporal attention networks, and a two-dimensional sequential architecture. The proposed SeqAttNet achieves an 87% improvement in efficiency while maintaining competitive performance with a network that is 70 times smaller. Furthermore, the number of training epochs required to achieve optimal results is reduced by 50%. The method significantly enhances efficiency, demonstrating twice the effectiveness of leading models such as TriDet. These advancements represent a step forward in the development of TAL methods, offering solutions that balance high accuracy and computational efficiency, making them suitable for practical applications in resource-constrained environments
Keywords:
Temporal Action Localization ; Activity ; Attention Networks ; Human Action Recognition ; Deep Learning ; Efficient Architectures ; Activity Net Dataset

Digital Object List

محتواي کتاب
view

Bookmark

Friend's email
Your name
Your email
enter code