Nowadays, the generated data of the world's IT infrastructures is rapidly growing. To handle enormous amount of data in a reasonable time, it is necessary to create and use efficient processing techniques. For that purpose, cloud resources and parallel approaches are widely used.
Stream processing is becoming the obvious choice to process continuously generated data in a distributed manner where the processing requires (near) real-time results or the storage capacity can be easily overgrown by the incoming data. There is a common need to improve the processing by using advanced data processing languages like R or Python.
This thesis is about to survey some of the available stream processing frameworks, the basic challenges of stream processing, and common algorithms, to present the implementation of a stream processing application that is hosted in the cloud and uses data processing languages in several processing steps. Finally, it reviews the possible solutions for integrating a common stream processing framework into a data analysis environment in order to access its provided tools for thorough analysis.
The goal of this thesis was to explore the capabilities of a common stream processing framework by implementing a distributed application that works with large amount of real-live network data.