Performance evaluation of a BigData analytics platform over Kubernetes

OData support
Dr. Simon Csaba
Department of Telecommunications and Media Informatics

Nowadays, Kubernetes is rapidly evolving and heavily being experimented and adopted in a lot of fields . Kubernetes and the container technologies become very popular, the motivations for it to be increasingly adopted are their convenience to encapsulate, deploy, and isolate applications, lightweight operations, as well as efficiency and flexibility in resource sharing instead of installing the operating system and all the necessary software in a virtual machine. In this thesis I demonstrate and review the available solutions of deploying and running Apache Spark applications over Kubernetes in a containerized environment. I demonstrate the different aspects and fundamentals of the solutions provided by different vendors, such as, Amazon, Google, and Microsoft Azure. My task included designing a solution which can be used to run Spark application over Kubernetes cluster with the integration with HDFS. For this reason I have deployed Apache Spark over both Hadoop cluster and Kubernetes cluster and integrated it with HDFS. For Running Spark on Kubernetes I took the advantage of the supportability included in Spark 2.3, where from this version Spark can run on Kubernetes by the advantage of Kubernetes scheduler. Over this thesis work I have provided an overview of the aspects of the virtualized environments, container management frameworks and Big Data Analytics over Hadoop. I Implemented and tested Running Spark jobs in different deployments, such as, Spark on Yarn, Spark standalone, and Spark over Kubernetes. Over this thesis I conclude that despite that Running Spark over Kubernetes is still in its experimental phase, running Spark jobs over Kubernetes cluster is fast, lightweight, and the resource sharing is very optimised within Kubernetes cluster. The executers will be running either based on the scheduler or if a custom configuration is done, e.g. by using the node selector which make it easier to tune the deployment and customize it based on the requirements.


Please sign in to download the files of this thesis.