Antivirus programs traditionally scan a computer’s hard drive and analyze each file at a time. They consider multiple features drawn from the file’s content, behavior and relationships with other files on that computer (for example which program downloaded the suspicious file). And, because it is difficult to recognize the feature combinations that distinguish malware from benign programs, we often turn to machine learning to create models of maliciousness.
To identify suspicious programs, we could instead look for coordinated actions across several hosts. The figure below shows a real-world example. On the first day, three programs, from three different hosts, connect to a remote server and download more files to their hosts. Seven days later, they download files from a different server and after nine more days they do this again from a third server. The behavior of each downloader, considered separately, is not a cause for suspicion, as the act of downloading files from the Internet is not inherently malicious. However, their coordinated actions expose the fact that the downloaders are controlled remotely.
This lockstep behavior does not represent a proof that the downloaders engage in malware distribution. However, detecting coordinated actions is complementary to the existing machine learning techniques, which produce models, defined by multiple features, that can be difficult to interpret. In contrast, recognizing a lockstep pattern in a stream of Internet-wide download events yields an intuitive explanation of the underlying activity. Moreover, after we whitelist the known-benign downloaders (like software updaters), the remaining locksteps usually point to suspicious activities. In the previous example, the payloads downloaded include malware samples, potentially unwanted programs (PUPs, such as adware) and more downloaders. In our experiments with download events from 1.9 million hosts, the false positive rate was below 5%.
We built a system called Beewolf for detecting lockstep behaviors. (Beewolves are a species of wasp that hunts bees, which display group behaviors.) Beewolf uses graph mining techniques to detect silent delivery campaigns, involving remotely-controlled downloaders. However, these techniques work whenever we can link fairly stable nodes (the downloaders, in our case) to nodes that change often (the malware-hosting domains) and each link represents a timestamped event (the malware and PUP downloads). Technically, lockstep detection involves searching this graph for near-bicliques with edges created in a brief time window. Beewolf performs the computationally intensive operations incrementally, as it receives new nodes and edges, and then detects locksteps with an efficient linear algorithm. With optimizations, Beewolf can process 1 year’s worth of download events in less than 20 s. Exposing coordinated actions in this way is helpful in a range of security problems; for example, prior research has used lockstep detection to tackle fraudulent behavior in online social networks. Beewolf detects locksteps deterministically (it does not rely on machine learning), in an unsupervised fashion (it does not need seed nodes, known to be malicious), and was designed to work on streaming data. You can find more details in our paper [NDSS 2017], and in the longer technical report.
Lockstep detection also exposes new relationships in the underground economy. Many PUP downloaders have valid digital signatures, to avoid raising suspicions, and two recent measurement studies have shown that these downloaders do not usually deliver malware. However, we cannot yet conclude that the malware-delivery and PUP-delivery ecosystems are completely separate. We identify representative publishers (rep-pubs) for each lockstep that Beewolf detects, and we analyze the relationships among these rep-pubs. There are two types of relationships. Direct download relationships between publisher pairs occur when a program signed by one publisher downloads a program signed by the second publisher; the prior studies analyzed this type of relationship. Observing indirect relationships, among publishers caught in lockstep together, can overcome evasive strategies such as obtaining code-signing certificates under multiple names or using unsigned downloaders for malicious payloads. This allows us to find a larger overlap between the malware and PUP delivery ecosystems than reported in the previous studies. We release the inter-publisher relationship detected with Beewolf at http://www.beewolf.org.
Paper: [NDSS 2017] [Technical Report]
Dataset: http://www.beewolf.org
References
[NDSS 2017] B. J. Kwon, V. Srinivas, A. Deshpande, and T. Dumitraș, “Catching Worms, Trojan Horses and PUPs: Unsupervised Detection of Silent Delivery Campaigns,” in Network and Distributed System Security Symposium (NDSS), San Diego, CA, 2017.
PDF