The aim of data mining is to retrieve hidden information from the massive amounts of data that are piling up in an accelerating pace. This information can be very valuable when used in business. Logistic regression is a common method for predicting the probability of the occurrence of an event. It uses other variables, known attributes.
Analyzing and preprocessing the input data can enhance the predictive ability of the model. Discretization or binning is a frequent preprocessing task. It means the division of the range of a continuous variable to discrete categories. In addition to better understanding, it can increase the precision of the model.
A great number of binning algorithms is known for solving the binning problem. The topic of this thesis is discretizing with a genetic algorithm. The remarkably robust genetic algorithms are needed most when the problem space is huge and the problem is rather complex. In these cases (like the discretization problem) we are satisfied with an almost optimal solution.
In this thesis, after summarizing the related fields, I propose a discretization algorithm based on the genetic principles. I describe the details of the algorithm and the causes of my decisions.
I implemented the algorithm in a SAS environment and performed numerous tests with the program. I refer to the testing methods, the aspects of examination and the observed results. According to the tests, the algorithm and the implementation fulfill the requirements. In acceptable time it produces a nearly optimal discretization, which measurably improves the accuracy of the logistic regression model. I also proposed new ways to improve the method.
The implementation can be used on any dataset to bin variables. Based on the tests, I made some recommendations on setting the parameters.