Data anonymization in big data systems

OData support
Supervisor:
Dr. Dudás Ákos
Department of Automation and Applied Informatics

We live in a day and age where every single piece of information has value; therefore, every huge technology company tries their best to gather as much as they can. This is not a problem as long as they keep this data safe inside the company’s closed system. However, as soon as they would like to share this data with a third party, they have to make sure that no private information is leaked about the users. This is where they have to use anonymization.

In this thesis, we look at various anonymization techniques, and implement a concrete anonymization algorithm, called Mondrian. Since usually the complete dataset is not available at any point in time (too much data to store or continuously gathered data where some of it will only be available in the future), we examine two different techniques for anonymizing continuously flowing data.

To test how much information is lost using the various anonymizing methods, we measure the anonymized datasets using several different metrics, some of which are defined in this document.

Finally, the product of this project is a software that is capable of anonymizing a single batch as well as continuously flowing data, and can be used with different settings.

Downloads

Please sign in to download the files of this thesis.