Nowadays, a lot of data are generated every day. Let us take for example the daily Internet traffic, the collected weather data in a day or changes in bank shares data. As the data storing capacity grows, the more amount of data and more detailed data is stored. These data can contain important information for the related companies, just like examining shopping tendencies or estimating weather models. To process these data, automatic and effective solutions are needed.
Storing and analyzing the data in relational databases cannot be done effectively with more than a certain big amount of data. That is why other solutions started to spread, which store and examine the data in a different way. Nowadays, the widely adopted solution for analyzing such data sets is provided by using distributed systems.
The Hadoop open-source framework allows processing very large data sets efficiently. The platform’s file system is called HDFS, which stores the data efficiently and the MapReduce technology can process the data in a fast and distributed way.
The Hadoop platform and the MapReduce programming model is going to be introduced in my paper. During my work, a data set, which has been analyzed by a variety of technologies, except the MapReduce technology, is going to be processed. The data set is going to be processed using Hadoop environment, so a new technology is going to be used to analyze the same data set. The effectiveness of certain data-processing algorithms is going to be examined while changing the size of the input data. Scaling rules are going to be analyzed between these by using appropriate metrics and their performance is going to be compared for the corresponding algorithms. Besides these, a possible optimalization is going to be tested, which is compressing the input data, and the results are going to be examined.