Distributed real-time stream processing with Samza

OData support
Prekopcsák Zoltán
Department of Telecommunications and Media Informatics

Processing the increasing amount of data produced in the world is impossible with the well-tried single node systems. The performance of the servers can't keep pace with the rate of growth. In order to get the business value lying deep inside datasets, new methods and technologies are needed, which implement effective coordination across computers. To solve this hard problem, complex computer frameworks were constructed.

Some available proprietary tools provide fast and easy developing and deploying, but the prices for them can be paid only by the largest multinational companies. The smaller enterprises needs too data processing solutions, they must choose from the open source frameworks. Using and deploying this systems however are much harder.

In this young field of informatics it is particularly big challenge to do the data processing in real-time. The thesis aims to overview the available open source, distributed, real-time systems and provide a prototype program, using one of them, called Apache Samza. Easy understandable and flexible query format is an important requirement against my implementation.

To satisfy the latter requirement, the Esper event processing system was integrated with Samza. During the design phase the identified two principal element of the data processing was put into two separate application. Measuring the performance of the prototype solution is essential part of this thesis. To fulfill this goal, a separate module was developed.

With the help of my implementation, I have examined several performance related parameter of Samza, using a multi node cluster setup. The experienced throughput and latency values are adequate. The solution can reach 90,000 events per second processing rate per node. During the tests optimal configuration of many parameter of Samza were determined.

The thesis concludes that the Apache Samza is a competitive solution of available systems. With knowing it's basic concepts and having some experience, developing with it is simple and fast. It is an appropriate tool - out of largest multinational corporations - for

smaller enterprises and medium-sized companies too.


Please sign in to download the files of this thesis.