The steadily increasing number of malware variants is becoming a significant problem, clogging the input queues of automated analysis tools and polluting malware repositories. The generation of malware variants is made easy by automatic packers and polymorphic engines, which can produce many distinct versions of a single executable using compression and encryption. Malware analysis tools and repositories rely on executable digests (hashes) for indexing malware programs and discarding duplicates. Unfortunately, these executable digests are different for each malware variant. Thus, a great deal of time and resources are wasted by analyzing, running, and storing numerous instances of almost identical programs. To address this problem, we require a more robust similarity measure that can quickly identify and filter these variants, avoiding repeated (costly) analyses that provide no additional insights to a malware analyst.
In this paper, we present a robust filter to quickly determine when a malware program is similar to a previously-seen sample. Compared to previous work, our similarity measure is efficient because it does not require the costly task of preliminary unpacking, but instead, operates directly on packed code. Our approach exploits the fact that current packers use compression and weak encryption schemes that do not break all connections between the original programs and their transformed version (that is, some indicators of similarity between two original programs can still be extracted from their packed version). In addition, we introduce a packer detection technique that is able to distinguish between different levels of protection, such as unpacked, compressed, encrypted, and multi-layer encrypted code. This allows us to configure (optimize) the sensitivity parameter for the similarity computation. We performed experiments on a large malware repository containing 795 thousand samples. Our results show that the similarity measure is highly effective in filtering out malware variants obtained by simple re-packing or re-encryption, and can reduce the number of samples that need to be analyzed by a factor of three to five.