How fresh Data affects Machine Learning algorithms

OData support
Gáspár Csaba
Department of Telecommunications and Media Informatics

The goal of this thesis is to provide a precise prediciton about the future energy consumption, based on energy market data set. The easiest way to solve this task is through data mining. The first step is to prepare the data, add new variables to it, and sort out the records, in order to achieve a data set, that is a valid input for a machine learning algorithm. I am going to use a regression model, called „Gradient Boosting Regressor”. With it’s help, we will teach the machine - using the data only from the past -, to how to predict the energy usage of the next day. Nowadays it began to spread, that companies are able to make changes in their previous predictions within the day. That means, if we could make a model, that could adjust the predicitions with the help of the fresh data - generated since the last estimation -, then we could have a more precise approximation, regarding the energy consumption. I am going to make this model in this thesis, and we will see, that the mean absolute percentage error will lower from the starting 2.5% to around 1%. Besides the machine learning techniques, we will cover an other method, called the “similar days”. This is aimed to solve the same task, but without the help of machine learning. In this case, the mean absolute percentage error will be slightly higher, around 3%. Then at the end, we are going to compare the two methods, and how fresh data affected their results. We will see that it’s worth it to use machine learning, because it is able to provide us a more precise estimation. In case of the ”similar days” method, involving the fresh data in the modelling process does not necessarily means, that the percentage error will lower.


Please sign in to download the files of this thesis.