In this thesis I devise data integration (ETL) tools in distributed computing environments. I describe a prototype distributed data integration tool, and show how data preprocessing and cleansing can efficiently solved for real-life web analytics and anomaly detection tasks. My software tool can be flexibly configured via XML that hides the low level streaming layer from the developer to ease the task of programming the streaming analytics tasks.
In recent years, interest has turned towards distributed analytics similar to the one addressed in this thesis. However no software level detailed description is found for streaming analytics systems. In my work I describe the design and implementation of a distributed data integration tool built on top of existing software, and examine its performance through experiments over a real-life insurance web log data set.
In my experiments I build my own Longneck data transform tool on top of Twitter Storm as a data streaming layer. I feed the system with 11 months of data consisting of over 100 million records. I observe near linear scaling in the number of the processing nodes. The system is capable of processing near 100,000 records in a second, hence it is capable of real time analytics for even the highest traffic Web portals.