Loading...
Fault Tolerant Distributed Computing in Deep Neural Networks
Zare Ahangarkolaei, Mohammad | 2020
469
Viewed
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 53153 (05)
- University: Sharif University of Technology
- Department: Electrical Engineering
- Advisor(s): Aref , Mohammad Reza; Maddah Ali, Mohammad Ali
- Abstract:
- Nowadays, with the development of machine learning and deep learning on one hand, and the dramatic increase in the amount of information available and the complexity of models on the other hand, it is practically impossible to implement learning algorithms on a single node. Thus it is inevitable to distribute learning algorithms on several machines. In a distributed system, the main operations are divided into smaller tasks and performed by different nodes. The final result is then calculated by exchanging messages among the existing nodes.In this thesis, a method is introduced to compute the gradient of a massive data set in a distributed system. In this problem, there are a number of worker nodes and a master node, where the master node intends to distribute the task of computing the gradient to the worker nodes.Distributing computations across different nodes also presents new challenges. Some of these challenges are the straggler nodes and the high communication load among nodes. The distribution of data and computations should be such that if some of worker nodes are straggler, the master node can compute the final result by receiving a response from non-straggler nodes. Moreover, data is very large and the amount of information that is exchanged is important.In this thesis, we focus on the case where gradients can be represented by law rank matrices. For this scenario, we propose a scheme that not only tolerate certain number of stragglers, but also exploit the rank deficiency of the gradient matrices to reduce the communications. The proposed scheme outperforms state of the art in terms of communication load and is optimum in some regimes
- Keywords:
- Deep Learning ; Distributed Computing ; Network Coding ; Large Scale Machine Learning ; Fault Tolerance ; Deep Neural Networks