ENEE759D :: Project Ideas

The aim of the project is to solve a specific security problem through data analysis techniques. The emphasis is on adequately supporting the conclusion through empirical results and reasoning about the implications, taking into account the factors that could threaten the validity of the conclusion. (Hint: a graphical representation is useful, but it's not exactly a rigorous proof). It is more important to answer one open question in depth than to analyze several questions and present superficial answers. The teams working on group projects shall follow good software engineering practices, and the code and data sets shall be adequately documented to support further development.

Growth of vulnerabilities and vulnerability exploits over time

Project objectives

It is widely believed that building bug-free software is a practical impossibility, and several studies have shown that the number of bugs tends to grow with the size of the code. This trend was also observed for security vulnerabilities, which are the bugs that can be exploited by a cyber attacker who wants to take over the system. However, fewer than 50% of vulnerabilities are actually exploited in real-world attacks, and it is not clear if the same trend holds for vulnerability exploits.

The goal of this project is to analyze the available data on vulnerabilities and vulnerability exploits to determine their growth rates and the factors that influence these rates. Software vulnerabilities are tracked systematically in several vulnerability databases, but information about exploits usually needs to be pieced together from several data sources, including penetration-testing software, descriptions of anti-virus and intrusion-detection signatures from popular anti-virus vendors, and lists of vulnerabilities targeted by black-market exploit kits.

Research questions

Has the number of vulnerabilities in popular software (e.g., Microsoft Windows, Adobe Acrobat, Java) increased or decreased over the past 10 years?
Has the number of vulnerabilities exploited increased or decreased?
Does the number of vulnerabilities and exploits grow with the size of the code?
What was the impact of technologies that render exploits less likely to succeed, such as address-space layout randomization (ASLR) or data-execution prevention (DEP)?
What other factors influence vulnerability and exploit trends?
How do the vulnerability scores (e.g., severity, impact vector) evolve over time?
How quickly are patches released?
How long do vulnerabilities live?
Which vulnerabilities pose the highest risk to users and how does this evolve over time?

A good project will not try to answer all these questions, but will analyze some of them in depth to rigorously support the final conclusion of the project.

Potential data sets

National Vulnerability Database (NVD): Structured information about vulnerability characteristics disclosures and their disclosure timeline
http://nvd.nist.gov/download.cfm
List of Metasploit modules: Proof-of-concept exploits, used in penetration testing
http://www.rapid7.com/db/modules
Exploit DB: public archive of exploits
http://www.exploit-db.com/
Open Source Vulnerability Database (OSVDB): Vulnerability database, aggregating some information from NVD and ExploitDB
http://www.osvdb.org/
Symantec’s anti-virus signatures: Descriptions of host-based attacks (e.g. viruses, worms) observed in the real world, some of them exploiting known vulnerabilities
http://www.symantec.com/security_response/landing/azlisting.jsp?
Symantec’s intrusion-detection signatures: Descriptions of network-based attacks (e.g., remote buffer overflow exploits, denial of service) observed in the real world, some of them exploiting known vulnerabilities
http://www.symantec.com/security_response/attacksignatures/

The students are encouraged to identify other data sets that could be useful for this project.

Malware families

Symantec estimates that 403 million new malware samples were created in 2011 (that’s more than 1 million new samples per day!). This explosion is driven by automated systems to pack and obfuscate malware code, in order to evade detection by anti-virus products. Cyber attackers start from a small number of malware families available and modify them with the goal of delivering a unique binary to each host infected.

The goal of this project is to group malware samples into malware families using publicly available meta-data about malware (but not by analyzing the malware code directly). It is challenging to place malware samples into families because of the fast-paced evolution of malware and because of the volume of new malware samples created each day. It is not possible to reverse engineer each unknown binary found on the Internet. The membership and evolution of malware families must therefore be reconstructed from research projects on malware clustering, sites that accept uploads of unknown binaries for scanning and the signatures developed by anti-virus vendors (note that such signatures are often generic, and they don’t always correspond to well-defined malware families).

Research questions

How can we group malware samples into malware families automatically, by using publicly available meta-data on malware (e.g. scanning results, AV signature descriptions)?
How do malware families evolve over time?
How are zero-day exploits included in exploits kits and used in large-scale attacks after the vulnerability’s disclosure?

A good project will not try to answer all these questions, but will analyze some of them in depth to rigorously support the final conclusion of the project.

Potential data sets

The Malicia Project: Approximately 12,000 malicious files, carefully clustered into families
http://malicia-project.com/dataset.html
Virus Total: Commercial file-scanning service that provides a REST API for retrieving information about the files scanned (e.g. the signatures triggered by each file on multiple anti-virus products)
https://www.virustotal.com/en/documentation/public-api/
Anubis: Research file-scanning service that performs a dynamic analysis (i.e. it executes the binary in a sandbox and reports their behavior)
http://anubis.iseclab.org/?action=home
Symantec’s anti-virus signatures: Descriptions of host-based attacks (e.g. viruses, worms) observed in the real world, some of them exploiting known vulnerabilities
http://www.symantec.com/security_response/landing/azlisting.jsp?
Symantec’s intrusion-detection signatures: Descriptions of network-based attacks (e.g., remote buffer overflow exploits, denial of service) observed in the real world, some of them exploiting known vulnerabilities
http://www.symantec.com/security_response/attacksignatures/

The students are encouraged to identify other data sets that could be useful for this project.