Analytics in Big Data Environment

OData support
Gáspár Csaba
Department of Telecommunications and Media Informatics

During text mining we work on unstructured text data. The main goal of this thesis

to examine artificial (Programing) languages with data mining and text mining. Both

science fields are dynamically evolving recent parts of the Information Technology.

Because of that, there are many new methods and models.

Further aim of this paper was to execute the analysis in big data environment. It

means the technologies must/should work on distributed systems. They are scalable, fault

tolerant and the algorithms are well parallelized.

I used open source technologies exclusively. To store the data I used Hadoop

HDFS (Hadoop Distributed File System). Data analysis was carried out with Apache

Spark, one of the youngest member of the Hadoop framework. With these tools I created

some baseline algorithms for classification. After that I implemented some more complex

data mining methods and compared them with baseline algorithms.

The first chapter is a short introduction, in the second one I summarize the

currently available models, methods and algorithms of data mining and text mining. In

the third and fourth chapter I demonstrated the main used technologies, Spark and

Hadoop. The fifth chapter contains the implemented project. I expounded the details of

used algorithms and the final solutions were analyzed. I made machine learning with

Spark MLlib library. In chapter six I summarized the results of my work and the

possibility of further development.


Please sign in to download the files of this thesis.