Loading...

Performance Estimation of Processing Nodes in Big Data Computation

Azadjoo, Farhad | 2016

543 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 48928 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Goudarzi, Maziar
  7. Abstract:
  8. Today, we have witnessed many changes in computer science. Rapid and exponential growth of data volumes, increasing a variety of data structures and forms, increasing data generation ratio are some important revolutions in our digital world. Because of these changes, our traditional computing models and processing methods cannot solve new big and sophisticate problems. Big data concepts and big data processing methods come to help us for solving these problems. A lot of parallel and distributed processing platforms built for the applications execution on a huge volume of data. Apache spark and Hadoop are two powerful platforms that each has some advantage and disadvantage. With growth of data volume, data velocity, data variety and data veracity, every engineer or company or organization who access to the source of valuable data, needs to have the big data processing facility. On the other hand, it’s impossible for everyone to create a powerful processing cluster with powerful servers or rent a lot of servers from data centers. Commodity computers like desktop PCs are cheap compared to servers. It is expected that have increased the usage of PCs for creating small or large clusters. When you use PCs as a computing node, because of the importance of time and cost, you should attend to your cluster performance and utilization. This is a big problem because the processing behavior of servers and PCs are different. Servers have a powerful CPU with a lot of cores, big memory capacity and they are more scalable. But PCs have a limited memory capacity and not very powerful CPU and processing resources. In this project we study the processing behavior of a wide range of applications, with Apache Spark, on top of desktop PCs. Our focus is on performance and utilization. We capture the behavior of CPU, Memory, Disk and Network when we run Spark workloads on a single node and cluster of computing nodes. Then we analysis the results and find bottlenecks. At the end, we suggest solution for putting down the bottlenecks. In addition, we propose a Gray-box method for estimation the execution time of different workloads with high accuracy
  9. Keywords:
  10. Performance ; Big Data ; Apache Spark ; Performance Estimation ; Performance Bottlenecks

 Digital Object List

 Bookmark

No TOC