More and more software vulnerabilities are discovered each year, and hundreds of public disclosures may occur on the same day. For example, the CVE database, which assigns unique identifiers to vulnerabilities in popular software, has adopted a new format that no longer caps the number of CVE IDs at 9,999 per year. On 14 October 2014, a wide range of vendors (including Microsoft, Adobe, and Oracle) disclosed 254 vulnerabilities. The security professionals who must respond to these disclosures are faced with the question Which vulnerabilities are likely to be exploited in the wild? (Hint: while many vulnerabilities have proof-of-concept exploits, developed for the disclosure process, real-world attacks exploit only a few vulnerabilities).
These professionals could look at sources outside their organization for answers to this question. On social media sites, like Twitter, a community of security vendors, system administrators, and hackers, discuss vulnerabilities and exploits. This is a rich source of information, as the participants in vulnerability disclosures discuss technical details about exploits (and may leak information before the planned public disclosure), and the victims of attacks share their experiences. And Twitter analytics have been used successfully for anticipating flu trends, movie revenues, stock prices, or earthquakes. But, unlike in these examples, few users (approximately 32,000) discuss vulnerability exploits on Twitter, and we lack a comprehensive ground truth (a broad list of vulnerabilities that are exploited in the real world) for making such predictions.
In 2014, we collected 1.1 billion security related tweets, including 287,717 tweets with explicit references to CVE IDs, and we designed and implemented an exploit detector that addresses some of these challenges. We also wrote a paper about this [USENIX Security 2015]. To implement our detector, we trained a machine learning classifier with 67 features extracted from Twitter data (e.g., specific words, number of retweets and replies, information about the users posting these messages), from vulnerability databases (e.g., vulnerability category, specific references) and from vulnerability scoring systems (e.g., the CVSS score and its exploitability sub-score). Our system has fewer false positives, improving the detection precision by one order of magnitude beyond what is achievable with information from vulnerability databases, and can detect exploits a median of 2 days ahead of existing data sets.
But there is one more complication: Twitter is a free and open service, and (almost) all the data sources we used are publicly available. We wondered if a smart adversary could poison our detector by using multiple Twitter accounts to inject false information. The adversary could infer the most useful features for our classifier even if we tried to keep them secret (because the data is public) and could poison our training data set (because Twitter accounts are free, and accounts with specific characteristics can be bought in bulk on underground markets). We defined three adversaries: blabbering adversary (injects random noise); word copycat adversary (uses fraudulent accounts to mimic the word features of exploited vulnerabilities); full copycat adversary (uses fraudulent accounts with many followers and can prompt retweets). Our evaluation suggests that the first two adversaries induce limited damage, while defending against the third adversary would require maintaining a whitelist of reputable and informative users to prevent poisoning attacks (this is feasible, as the information most useful to our classifier comes from only 4,335 Twitter users). More importantly, these results illustrate the worst-case damage that an adversary can inflict on a machine learning system that uses undisclosed features, ground truth sources or hyperparameters (as is common in the industry) and the extent to which the integrity of these systems depends on the secrecy of their features.
Vulnerability scoring systems (e.g. CVSS, Microsoft’s exploitability index or Adobe’s priority ratings). These systems err on the side of caution by marking many vulnerabilities as likely to be exploited, which causes false positives. For example, with CVSS and its sub-scores the detection precision does not exceed 9%.
In 2014, Soska and Christin published a paper on predicting whether a given, not yet compromised, website will become malicious in the future. Unlike for general software vulnerabilities, exploits of known Web vulnerabilities can be identified systematically, through repeated Internet scans, which allows these researchers to generate ground truth automatically.
Bozorgi et al. published a paper on using features of the vulnerabilities disclosed between 1991–2007, extracted from vulnerability databases, to predict exploits. This paper focuses on predicting proof-of-concept exploits, rather than exploits used in real attacks. What’s more, we reexamined this experiment in our paper (using Twitter instead of some data sources that are no longer available) and we could not reproduce the reported performance. This is because of important changes in the threat landscape since 2007: today, proof-of-concept exploits are less centralized (we have found links to exploits published on blogs or mailing lists rather than in vulnerability databases) and a lower fraction of disclosed vulnerabilities is exploited.
[USENIX Security 2015] C. Sabottke, O. Suciu, and T. Dumitraş, “Vulnerability disclosure in the age of social media: Exploiting Twitter for predicting real-world exploits,” in USENIX Security Symposium (USENIX Security), Washington, DC, 2015.