The collection of personal data is becoming more widespread in our everyday lives due to continuous advances in technology. The digitalization of society leads to more and more of our personal data being available on the internet: everything from simple software telemetry to patient records. With the growing influence of data collectors, the right to personal privacy and data protection is also garnering more attention. Nothing highlights this better than the GDPR (General Data Protection Regulation), a new regulation by the European Union aimed at providing a clear framework for data protection and privacy. One possible solution to the privacy problem is anonymizing the collected data. While this works when the complete dataset is available upfront, applying it to a continuous stream of incoming data can run into some difficulties. Naive implementation lead to both poor performance and suboptimal quality for the anonymized data.
In this thesis, I will first present some of the most widespread anonymization methods then show an incremental version of the Mondrian algorithm that is capable of processing continuously incoming data. I will also examine the possibilities of parallelizing the incremental algorithm and will present a way to persist the results of the anonymization. The quality of the incremental method comes close to the traditional solution and has significantly better performance in the case of streaming data.
I believe that the presented algorithm can become an important part of any system that requires continuous data collection in a way that respects the users’ personal privacy.