The aim of the project is to solve a specific security problem through data analysis techniques. The emphasis is on adequately supporting the conclusion through empirical results and reasoning about the implications, taking into account the factors that could threaten the validity of the conclusion. (Hint: a graphical representation is useful, but it's not exactly a rigorous proof). It is more important to answer one open question in depth than to analyze several questions and present superficial answers. The teams working on group projects shall follow good software engineering practices, and the code and data sets shall be adequately documented to support further development.

Growth of vulnerabilities and vulnerability exploits over time

Project objectives

It is widely believed that building bug-free software is a practical impossibility, and several studies have shown that the number of bugs tends to grow with the size of the code. This trend was also observed for security vulnerabilities, which are the bugs that can be exploited by a cyber attacker who wants to take over the system. However, fewer than 50% of vulnerabilities are actually exploited in real-world attacks, and it is not clear if the same trend holds for vulnerability exploits.

The goal of this project is to analyze the available data on vulnerabilities and vulnerability exploits to determine their growth rates and the factors that influence these rates. Software vulnerabilities are tracked systematically in several vulnerability databases, but information about exploits usually needs to be pieced together from several data sources, including penetration-testing software, descriptions of anti-virus and intrusion-detection signatures from popular anti-virus vendors, and lists of vulnerabilities targeted by black-market exploit kits.

Research questions

A good project will not try to answer all these questions, but will analyze some of them in depth to rigorously support the final conclusion of the project.

Potential data sets

The students are encouraged to identify other data sets that could be useful for this project.

Malware families

Symantec estimates that 403 million new malware samples were created in 2011 (that’s more than 1 million new samples per day!). This explosion is driven by automated systems to pack and obfuscate malware code, in order to evade detection by anti-virus products. Cyber attackers start from a small number of malware families available and modify them with the goal of delivering a unique binary to each host infected.

The goal of this project is to group malware samples into malware families using publicly available meta-data about malware (but not by analyzing the malware code directly). It is challenging to place malware samples into families because of the fast-paced evolution of malware and because of the volume of new malware samples created each day. It is not possible to reverse engineer each unknown binary found on the Internet. The membership and evolution of malware families must therefore be reconstructed from research projects on malware clustering, sites that accept uploads of unknown binaries for scanning and the signatures developed by anti-virus vendors (note that such signatures are often generic, and they don’t always correspond to well-defined malware families).

Research questions

A good project will not try to answer all these questions, but will analyze some of them in depth to rigorously support the final conclusion of the project.

Potential data sets

The students are encouraged to identify other data sets that could be useful for this project.