Development of distributed algorithms for Apache Hive

OData support
Supervisor:
Prekopcsák Zoltán
Department of Telecommunications and Media Informatics

In the last decade the electronic data storage devices went through a major improvement considering the devices’ capacity while their price has been lessening continuously and significantly. Thus, more and more data is stored in the everyday business life. In order to retrieve concrete and useful information, the hidden connections within the plain data need to be discerned. For this purpose the traditional data processing and data analyzing techniques are not - or not completely - capable. It is more than possible that even if the applied process is able to produce the desired results, the processing may take so much time that the results will not be usable anymore.

However, identifying the statistical and analytical algorithms where the existing implementations are not suitable or not efficient for Big Data purposes is not an obvious task. In my thesis I present and implement several business intelligence related (OLAP) and dimensionality reduction functions.

In my present work I improved these dimensionality reduction algorithms and an OLAP function in a distributed way so that they can be used efficiently when the number of dimensions does not allow the usage of the traditional methods. To implement these techniques, I used the well-known Hadoop platform extended with a data warehouse infrastructure, the Apache Hive. I applied Hive’s option to create user defined functions (UDFs) in Java to produce the correlation matrix required for the principal component analysis and to count the entropy and conditional entropy required for producing the widely used feature ranking index, the information gain. Moreover, I implemented the pivot OLAP operation on the Hive platform.

Proven by the measurement results presented in the second part of the paper, the traditional methods and some of the Hive built-in functions are practically useless when the input data has more than a few hundred dimensions. My solutions however can execute these tasks efficiently on data sets consisting of thousands of attributes.

Downloads

Please sign in to download the files of this thesis.