Apache Hive is an open source data warehouse platform originally built on top of Hadoop by Facebook. Hive makes the work of data scientists easier by introducing a language similar to SQL, called HiveQL. Hive takes a query written in HiveQL, does the parsing and analyzing and if possible, optimizes the query using several optimization strategies. Finally, it creates a Hadoop job and executes it on the platform. Currently, not only MapReduce can be used as an execution engine; Hive can even work on Apache Spark or Apache Tez.
Hive faces memory problems under certain circumstances. HiveServer2 is one of the main components of Hive and it often crashes due to Out Of Memory (OOM) errors. In my thesis, I aim to build a basic model to understand why and when the memory usage of HiveServer2 rises and find memory-related problems or wastes.
To be able to build a model, I created a basic tool for getting memory information and generating heap dumps automatically during the life of a query. With the help of my tool and other memory analysis tools, I generated and analyzed many heap dumps. I was able to identify two issues that can cause a lot of pressure on heap memory. One of the issues is introduced by HDFS (the distributed file system of Hadoop), therefore touching Hadoop was also necessary. For both memory issues, I suggested a possible solution and created an implementation that can help to eliminate memory overheads. These patches required multiple tests: local, simple performance testing and scalable benchmarking on a data center cluster. The issues identified are still ongoing and currently under investigation since the HDFS patch might introduce unexpected CPU overhead which is hard to detect, so deciding whether this tradeoff is negligible requires a thorough testing.