Many websites on the internet contains malware, some intentionally and some as victims themselves can infect their visitors. The effective analysis of these malicious websites can aid researchers in discovering previously unknown, partially or entirely new forms of malware.
The ITWEF system of the CrySyS laboratory at BME is capable of identifying malware infections from lists of potentially attacker URLs. It carries out dynamic analysis by recording traces of infection while visiting a suspicious site. The basis of the solution are the URL feeds originating from third party sources. Unfortunately the huge quantity, bad quality (fals-positives, duplicates) and short lifespan of the URLs in the feeds make the analysis difficult.
My task is to filter, analyze and preprocess the lists of potentially malicious website URLs in order to improve the chance of finding malware. I implement a feed processing solution that utilizes various preprocessing methods to filter and prioritize the URLs.