The FeatureSmith Dataset

This site provides supplemental information for the paper FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature, by Ziyun Zhu and Tudor Dumitraș. This paper describes the design of a system that can generate, without human intervention, features for training machine learning classifiers to detect Android malware malware. FeatureSmith achieves this by synthesizing the security knowledge described in natural language documents, such as papers published in security conferences and journals.

The key data structure in FeatureSmith is the semantic network, which encodes the knowledge about malware behaviors reflected in our corpus of documents. The semantic network has three types of nodes:

Known malware families (e.g. Gappusin)
Abstract malicious behaviors (e.g. "steal sensitive data")
Concrete features that can be extracted from Android apps through static analysis (e.g. the SEND_SMS permission).

The links among these nodes reflect concepts that are semantically related and allow us to rank the features according to how useful they are for detecting malicious behaviors. For example, here are the top-5 automatically engineered features (you can click the feature to find related behaviors; the format is explained below):

To stimulate further research into semantics-aware security, we release our semantic network and the automatically generated features.

We are making our data available to the security community. If you use the FeatureSmith data set in your research, don't link to this page; instead, please cite our paper:

Z. Zhu and T. Dumitraș, "FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature," in ACM Conference on Computer and Communications Security (CCS), Vienna, AT, 2016.

Data Sources and Collection Method

Our primary data source consists of scientific papers. We utilize these papers to extract Android malware behaviors and to construct the semantic network. From the electronic proceedings distributed to conference participants, we collect the papers from the IEEE Symposium on Security and Privacy (S&P'08–S&P'15), the Computer Security Foundations Symposium (CSF'00–CSF'14), and USENIX Security (Sec'11). We complement this corpus by searching Google Scholar with the keywords “Android malware”, and then we download the PDF files if a download link is provided in the query results. This process may result in duplicate papers, if a returned paper already exists in our corpus. Therefore, we record the hash of all the papers in our corpus, and remove a PDF document if the file hash already exists in the data set. In total, our corpus includes 1,068 documents. Table below is the summary of our document corpus.

Document source	count
IEEE S&P	465
IEEE CSF	437
USENIX Sec	35
Google Scholar	241
Total	1,068

We extract the text from the papers in PDF format, for later processing. We develop several heuristics to address this problem. We convert the PDF files to text with the Python package pdfminer.

We collect the malware family names from both the Drebin dataset and from a list of malware families caught by the Mobile-Sandbox analysis platform.

Malware source	count
Drebin	180
Mobile-Sandbox	210
Total	280

In total, we collect 280 malware names. We utilize these names when mining the papers on Android malware to identify sentences that discuss malicious behaviors. In addition to the concrete family names, we also utilize the term “malware” and its variants for this purpose.

We select permissions, intents, and API calls as potential features for malware detection. We collect all the permissions, intents and API calls from Android developer documents (API level 23). Then, we ignore the class name for each feature, because we have found that class names are not mentioned in most papers. Then based on the names, we remove some of features that potentially cause ambiguity like "length". The features are summarized below.

Feature source	count
permissions	132
intents	189
API methods	11,373
Total	11,694

Note that not all the malware and features are mentioned in our document corpus, so the actual number of malware and features is lower than the number we show above.

Semantic Network

FeatureSmith represents the knowledge about malware behavior using a 3-layer semantic network. The data is in a sqlite database, which consists of 3 tables: mal2behav, feat2behav andbehaviorScore, described below.

mal2behav

This table describes the links between known malware families and malware behaviors.

Column	Description
malware	The name of malware family
behavior	Description of behavior, which consists of subject, verb and object. We use "-->" as the delimiter for the three components. For example, "-->delete-->user_data" means that the verb is "delete" and the object is "user data". In this case, the subject is NULL.
colocation	How many times the behavior and feature appear together within a certain window. In our experiment, we set the window to be 3 sentences.

feat2behav

This table describes the links between malware behaviors and app features.

Column	Description
feature	The name of feature
behavior	The same as mal2behav table
colocation	The same as mal2behav table

behaviorScore

This table describes the initial weight for the behaviors, which we use when learning which behaviors are most closely related to Android malware.

Column	Description
behavior	The same as mal2behav table
score	The score of behavior

You can download the database here.

Acknowledgments

This research was partially supported by the National Science Foundation (grant 5-244780) and by the Maryland Procurement Office (contract H98230-14-C-0127). This website represents the position of the authors and not that of the aforementioned agencies.

The following institutions were given access

University of Illinois at Urbana-Champaign, USA
University of Melbourne, Australia
Shanghai Jiao Tong University, China
University of Virginia, USA
The Hong Kong Polytechnic University, China
Chinese Academy of Sciences, China
Bangladesh University of Engineering & Technology, Bangladesh
University of Maryland, Baltimore County, USA
Daffodil International University, Bangladesh
FAST National University of Computer and Emerging Sciences, Peshawar, Pakistan
University of Technology Sydney, Australia
Arizona State University, USA
University of Luxembourg, Luxembourg
University of British Columbia, Canada
Universiti Teknikal Malaysia Melaka, Malaysia