Malware similarity discovery has been a major research area in Information Technology since the early 2000s. Back then, malware were not as common as they are today. The spread of Information Technology provided cyber criminals with low defense and high value targets in healthcare, governmental institutions and industry. This lead to the evolution of defensive software. Since then, there is an ongoing battle between malware authors and antivirus vendors. Vendors implemented signature based detection and malware authors responded with millions of variants. This brought the need for malware similarity discovery.
The similarity of samples is very diverse. There are some that only differ in a few bytes and there are others that appear to have nothing in common. Still the same origin is present in all similar samples. Thus the discovery of every similarity is a difficult task but not impossible.
Manual inspection is not applicable at this scale so automatic methods have been developed. Some applications required fast performance and so used static analysis. Others needed low false positive ratio so dynamic techniques were implemented. As the latter generally takes longer time to finish my focus is limited to static approaches.
Solutions based on machine learning provided useful features that are able to represent a maliciousness of a file or family membership. They also proved that static analysis is able to deliver high accuracy. Analytical solutions use features based on the file structure. Some work on binary streams, some require the assembly code of the programs. But even though these methods are automatic their main focus is on accuracy, not on throughput. Thus they are too slow to be applied in a large malware database.
Locality Sensitive Hashing (LSH) focuses on capturing the similarity of binary sequences. Most of the methods were developed to work on diverse file types and were not thoroughly tested on executables. One of them is SSDEEP, the current state-of-the-art LSH method. My evaluation on SSDEEP showed it’s weaknesses and borders of it’s applicability. TLSH and LZJD are more recent methods and provide more precise results with higher confidence. My research shows that TLSH is much preferable to SSDEEP. LZJD performs nearly as well as TLSH but it’s slow digest calculation and hash comparison time as well as it’s digest size make it inapplicable at large scale.
As my task is to make our database of over 300 000 000 samples searchable by similarity, I chose SSDEEP and TLSH for the final application. The results show that my solution is able to select similar samples from the database with high confidence in just a matter of minutes.