During text mining we work on unstructured text data. The main goal of this thesis
to examine artificial (Programing) languages with data mining and text mining. Both
science fields are dynamically evolving recent parts of the Information Technology.
Because of that, there are many new methods and models.
Further aim of this paper was to execute the analysis in big data environment. It
means the technologies must/should work on distributed systems. They are scalable, fault
tolerant and the algorithms are well parallelized.
I used open source technologies exclusively. To store the data I used Hadoop
HDFS (Hadoop Distributed File System). Data analysis was carried out with Apache
Spark, one of the youngest member of the Hadoop framework. With these tools I created
some baseline algorithms for classification. After that I implemented some more complex
data mining methods and compared them with baseline algorithms.
The first chapter is a short introduction, in the second one I summarize the
currently available models, methods and algorithms of data mining and text mining. In
the third and fourth chapter I demonstrated the main used technologies, Spark and
Hadoop. The fifth chapter contains the implemented project. I expounded the details of
used algorithms and the final solutions were analyzed. I made machine learning with
Spark MLlib library. In chapter six I summarized the results of my work and the
possibility of further development.