Processing the available information and using the gained knowledge have become increasingly important for the companies. Due to the fact that technical development allows for cheap storage of data, even those are stored and used, which had previously been regarded irrelevant. In addition, data could be stored for several years, instead of the few months from before. However, such a vast amount of data cannot be analyzed effectively by using traditional queries on relational databases. Therefore, these have been succeeded by applications written on data transformation frameworks for distributed systems and distributed databases.
It is important to run these applications automatically, consequently subsystems that are responsible for starting them are needed. Owing to the fact that they draw from the same resource pool, it should be examined how many of them can run simultaneously and how can the resources be used in the most effective way.
Over the course of my work, I have implemented a scheduler’s logic for distributed systems built on a Big Data technology stack, which is responsible for optimal resource usage based on the available information during the scheduling process.
In order to test this, I have created a simulation process that produces statistical data that would have been collected during the previous runs of the applications. I have tested the scheduling logic with this data and used it to set the parameters for the machine learning algorithms. Finally, using the tuned machine learning algorithms, I have examined the effectiveness of the system and compared it to primitive scheduling logic and perfect predictions. Based on the tests, it could be stated that my proposed solution uses on average 51% fewer resources than the primitive method.