Recently, the amount of data generated daily started to grow rapidly and the so-called Big Data has been developed in the field of science to efficiently store and manage these large data sets. In the world of the Internet, we can say that everything is measurable and may be logged. Using the observable data on the Internet, we can easily find answers to some interesting questions. We can find out who, from where and when visited a website or what king of online service he used. Someone’s location also can be figured out from his communication.
However, the big companies are paying more attention to the protection of customer data, as these are sensitive regarding individual rights. Still, it may be necessary to transfer this data to third party companies and this is the point when data anonymization comes in the picture. During this process, the sensitive customer data is encrypted or removed.
In my thesis, I would like to show existing methods for managing and anonymizing large data sets efficiently and safely. I will examine the theoretical and technical aspects of this issue. Then, using this knowledge I will develop a method for the anonymization task. In the process, I have to keep in mind that the results of the analysis carried out on these data must be useful in the source system, so that anonymization must be reversible to be able to process data using these results in the source system.
My goal is to develop a method, which is implementation-independent, so it can be implemented using different technology. To demonstrate this method, I will prepare my own anonymizing tool on a selected Big Data platform.