Fault Tolerant Distributed Computing in Deep Neural Networks

Zare Ahangarkolaei, Mohammad; Aref , Mohammad Reza Maddah Ali, Mohammad Ali

Please enable javascript in your browser.

Fault Tolerant Distributed Computing in Deep Neural Networks

Zare Ahangarkolaei, Mohammad | 2020

469 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 53153 (05)
University: Sharif University of Technology
Department: Electrical Engineering
Advisor(s): Aref , Mohammad Reza; Maddah Ali, Mohammad Ali
Abstract:
Nowadays, with the development of machine learning and deep learning on one hand, and the dramatic increase in the amount of information available and the complexity of models on the other hand, it is practically impossible to implement learning algorithms on a single node. Thus it is inevitable to distribute learning algorithms on several machines. In a distributed system, the main operations are divided into smaller tasks and performed by different nodes. The final result is then calculated by exchanging messages among the existing nodes.In this thesis, a method is introduced to compute the gradient of a massive data set in a distributed system. In this problem, there are a number of worker nodes and a master node, where the master node intends to distribute the task of computing the gradient to the worker nodes.Distributing computations across different nodes also presents new challenges. Some of these challenges are the straggler nodes and the high communication load among nodes. The distribution of data and computations should be such that if some of worker nodes are straggler, the master node can compute the final result by receiving a response from non-straggler nodes. Moreover, data is very large and the amount of information that is exchanged is important.In this thesis, we focus on the case where gradients can be represented by law rank matrices. For this scenario, we propose a scheme that not only tolerate certain number of stragglers, but also exploit the rank deficiency of the gradient matrices to reduce the communications. The proposed scheme outperforms state of the art in terms of communication load and is optimum in some regimes
Keywords:
Deep Learning ; Distributed Computing ; Network Coding ; Large Scale Machine Learning ; Fault Tolerance ; Deep Neural Networks

Digital Object List

محتواي کتاب
view

Bookmark

Friend's email
Your name
Your email
enter code