Using databases is the primary option for storing data. The need for acquiring relevant metadata such as relations between data fragments and extract new and useful information from these databases is unsurprising. For this purpose we can use different kinds of machine learning technologies as well as statistical methods. Thus, data mining is becoming more and more popular.
The motivation of my thesis is a data mining competition hosted by Kaggle.com. Participants are given the task of using the aforementioned technologies to process raw data provided by Bosch. The goal is to design a solution capable of deciding whether or not items are defective. The project should follow the general workflow of a data mining process, as discover and prepare the data, implementing machine learning and evaluate the results of these models. The provided raw data regarding the products were captured during manufacturing in an industrial environment. The end goal is to be able to filter faulty items as soon as possible in order to reduce production cost on machine parts.
In my thesis I describe the process of solving the data mining project. The underlying logic of the solution, the understanding and manipulation of the data, modeling and evaluation of the results are all explained in detail. Furthermore, I provide information on the Python programming language and imported external libraries I used to implement my design.