In the past 10 years, a huge explosion happened in the world of data storage systems, the amount of stored data increased more than anybody could imagine before. To support this extreme growth, it was necessary to develop new distributed computer systems, which lead us to significant infrastructural and architectural differences between data storing and processing and pure computational distributed systems. Of course, they have to fulfill their original goals: they have to be scalable, reliable and have to be able to operate in efficient distributed way as possible.
Data Mining wants to extract new, previously unknown knowledge from the available, large dataset. Nowadays, analysts have many user-friendly graphical software for analyzing data, but they lack of supporting large datasets, moreover, they can not handle specifically very large (TB or PB scale) datasets.
The objective of my work is to introduce the world of popular distributed data storage and computational frameworks for the reader, to introduce their advantages as well as their disadvantages. After I describe RapidMiner, one of the most applied free, open-source data mining tool, I will examine the integration possibilities between RapidMiner and an appropriate distributed system. Hadoop distributed system was selected for my extension and I will give an detailed description about the main decision steps of the integration process. In the last chapter, I will show the performance result of the implemented extension which clearly tells us that it is very good for processing large scale data and the integrated distributed data mining algorithm can process huge amount of data.
My final goal is creating a piece of software, which is able to give very large scale data processing support for Rapidminer. In the process of developing, I payed particular attention for applying advanced programming design pattern, so the implementation of the software fulfills the requests of today's high-level programming languages.