Loading...
Evaluation of Performance and Power Improvement Methods for Inference in Deep Neural Network-based Speech-to-Text Conversion on Mobile Devices
Katebi, Hossein | 2022
380
Viewed
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 55561 (19)
- University: Sharif University of Technology
- Department: Computer Engineering
- Advisor(s): Goudarzi, Maziar
- Abstract:
- Automatic Speech Recognition (ASR) systems are a significant part of Personal Assistants in mobile phones. But because of the time-dependent nature of ASR systems, they are computation and memory-intensive tasks. On the other hand, mobile devices utilize a Low-Power design to extend battery life and improve user experience, making them incompatible with heavy-loaded tasks such as ASR systems. For instance, if we run an inference with a 60 seconds audio file on a well-known open-sourced Speech Recognition System named DeepSpeech, it will only take 49 seconds for a desktop PC to generate the results. Still, a mobile phone with ARM64 architecture with the same input file will take 92 seconds to finish the task.However, there have been significant breakthroughs in optimizing Deep Neural Networks (DNN) for low-power devices over the past recent years. These methods look promising, and some report substantial improvements. These methods can be divided into 3 major categories: First are the methods which apply in the training phase. Second, the methods that require fine-tuning after applying to the model but do not need to be involved in training. The third set of methods are the methods, that do not require any fine-tuning and can be applied directly to the trained model. In this research, we will focus on the third set of methods. Applying these methods to an existing model won’t require much extra time and resources, and some of these methods are already implemented in the most popular DNN frameworks and this will ease using them on mobile devices.In this research, we evaluated the effect of applying some well-known methods on DeepSpeech. The selected methods include:Quantizing weights to 8-bit integers.Quantizing weights and activations to 8-bit integers.Quantizing weights to 16-bit float numbers.To this end, we applied selected methods to the DeepSpeech model and profiled the performance, power, and time consumption of each method. However, we observed that applying 8-bit integer quantization will only cost a negligible accuracy loss, but will improve the performance of the model up to 75% which allows the model to run in real-time on mobile devices. Next, we explored the time-breakdown of the model inference execution and realized that even with applying 8-bit quantization, memory access is still responsible for over 50% of the execution time of the model. In the end, we compared different profiled method’s execution from various aspects.Then, we proposed four methods and memory layouts to optimally use models with sub-byte weights (4-, 2- and 1-bit) in mobile devices. We observed that the method which uses 2-bit weights, can gain 40% speed-up on end-to-end inference against the best existing method (method with 8-bit integer weights).
- Keywords:
- Deep Learning ; Evaluation ; Speech Recognition ; Mobile Device ; Deep Neural Networks ; Power Reduction