High level data processing languages on Hadoop

OData support
Kazi Sándor Antal
Department of Telecommunications and Media Informatics

Today, the amount of data is growing aproximately 40 percent per year, if we correlate it to the previous year. The „big data” expression (which has spread worldwide in the last few years) refers to complex, non structured data. The analysis of them can be huge benefit for corporations. The companies of our days cannot be succesful without the utilization of possibilities what are given by „big data”. In order to remain competitive, they have to process and explore huge amount of data in unusually short times. They have to employdata scientsts, because aneverydayanalyst is not be enough for the new expectations. They use data models and algorithms to prepare strategic decisions, and they give ideas about operational decisions like pricing or calculating the amount of future products.

The basis of these decisions is the „big data” analysis, which can not be solved by a simple database inrfastructure for structured data, because of the huge amount of it.

Being aware of the above it is not surprising, that there are researches with great importance which are specialized for data extraction from non-structured datasets.

The Google File System and MapReduce was developed for this purpose. It was a good base to for a variety of open source softwares to appear. One of the most widely used frameworksbased on this innovation is Hadoop, currentlyconsist of and issupported bya wide range of softwares and librarys. In this thesis, I comparetwo Hadoop based distributions, which have an SQL-like interface. The Apache Hive is created to make the Hadoop environment more comfortable. Currently the newest available version of Hive is 0.14. The other program of my comparison is Apache Drill, which will be used in version 0.6.0. It is based onthe latest Hadoop technologies, because of it’s continous improvements and development.


Please sign in to download the files of this thesis.