Performance Estimation of Processing Nodes in Big Data Computation

Azadjoo, Farhad; Goudarzi, Maziar

Please enable javascript in your browser.

Performance Estimation of Processing Nodes in Big Data Computation

Azadjoo, Farhad | 2016

543 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 48928 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Goudarzi, Maziar
Abstract:
Today, we have witnessed many changes in computer science. Rapid and exponential growth of data volumes, increasing a variety of data structures and forms, increasing data generation ratio are some important revolutions in our digital world. Because of these changes, our traditional computing models and processing methods cannot solve new big and sophisticate problems. Big data concepts and big data processing methods come to help us for solving these problems. A lot of parallel and distributed processing platforms built for the applications execution on a huge volume of data. Apache spark and Hadoop are two powerful platforms that each has some advantage and disadvantage. With growth of data volume, data velocity, data variety and data veracity, every engineer or company or organization who access to the source of valuable data, needs to have the big data processing facility. On the other hand, it’s impossible for everyone to create a powerful processing cluster with powerful servers or rent a lot of servers from data centers. Commodity computers like desktop PCs are cheap compared to servers. It is expected that have increased the usage of PCs for creating small or large clusters. When you use PCs as a computing node, because of the importance of time and cost, you should attend to your cluster performance and utilization. This is a big problem because the processing behavior of servers and PCs are different. Servers have a powerful CPU with a lot of cores, big memory capacity and they are more scalable. But PCs have a limited memory capacity and not very powerful CPU and processing resources. In this project we study the processing behavior of a wide range of applications, with Apache Spark, on top of desktop PCs. Our focus is on performance and utilization. We capture the behavior of CPU, Memory, Disk and Network when we run Spark workloads on a single node and cluster of computing nodes. Then we analysis the results and find bottlenecks. At the end, we suggest solution for putting down the bottlenecks. In addition, we propose a Gray-box method for estimation the execution time of different workloads with high accuracy
Keywords:
Performance ; Big Data ; Apache Spark ; Performance Estimation ; Performance Bottlenecks

Digital Object List

محتواي کتاب
view

Bookmark

No TOC

Friend's email
Your name
Your email
enter code