Loading...

Investigating Performance Bottlenecks for Efficient Implementation of MapReduce in Hadoop

Arabzadeh, Morteza | 2014

598 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: English
  3. Document No: 46136 (52)
  4. University: Sharif University of Technology, International Campus, Kish Island
  5. Department: Science and Engineering
  6. Advisor(s): Goudarzi, Maziar
  7. Abstract:
  8. Ever-increasing development and growth of information volume is an unprecedented phenomenon. Analyzing and saving such enormous volume of information calls for innovative ideas capable of processing and managing this information. One of the successful projects in this regard done by Apache is known as Hadoop.Hadoop is a popular open-source implementationof MapReduce processing schemefor analysis of large datasets. The heart of Hadoop is MapReduce that is a parallel programming model for data processing on clusters. To handle storage resources across the cluster, Hadoop employs a distributeduser-level filesystem. The Hadoop Distributed File System (HDFS) is written in Java and is designed for portability across heterogeneous hardware and software platforms.In this thesis, we have investigatedthe time spent in various parts of the system. To understand the application behavior and pinpoint the bottlenecks for more efficient implementations,we simulated and analyzed some benchmarks, in accordance with their bottlenecks.We tuned Hadoop and more detailed analyses were done by applying one of the well-known cluster monitoring tools named Ganglia.The parameters under consideration include the number of “Map” and “Reduce” tasks for the applications with underutilization of CPU. Optimum values for these two parameters were obtained based on the results, which was equal to the number of nodes in the clusters.Also, we have adopted a method that reduces the size of stored data by compression of output files in map phase and activating compression parameter. We have observed optimization in execution timeand storage
  9. Keywords:
  10. Cluster ; Parameter Tune ; Hadoop ; Map Reduce Processing ; Hadoop Distributed File System ; Ganglia Monitoring Tools

 Digital Object List

 Bookmark

No TOC