Throughout the last couple of years more and more system came popular that are capable of storing large quantities of data. However, data on its own has little value for the applications and businesses. It’s the information, the trends and special phenomena’s behind them that are truly interesting in this case. Data mining focuses on the reconnaissance, identification and characterization of the information from the pure data.
Nevertheless, this road is often long and full of obstacles. We need to solve issues such as: storing and structuring the data, reduction of the input data set (sampling and noise filtering), transformation, trying out multiple algorithms to find the optimal one; and finally we need to understand, and evaluate the results. With the significant increase of the data quantity in the last years, the upper issue set just intensified.
Moreover, open source project became more and more popular. These try to solve recurring problems by pulling in all interested professionals in the given domain in order to create platforms capable of solving those on a daily basis in a general fashion. When it comes to distributed file storage and processing systems such is the Apache Hadoop and HBase. In the field the machine learning (data mining) such is the Apache Mahout system, which on its own builds on the previously mentioned technologies.