This site provides supplemental information for the paper FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature, by Ziyun Zhu and Tudor Dumitraș. This paper describes the design of a system that can generate, without human intervention, features for training machine learning classifiers to detect Android malware malware. FeatureSmith achieves this by synthesizing the security knowledge described in natural language documents, such as papers published in security conferences and journals.
The key data structure in FeatureSmith is the semantic network, which encodes the knowledge about malware behaviors reflected in our corpus of documents. The semantic network has three types of nodes:
SEND_SMS
permission).The links among these nodes reflect concepts that are semantically related and allow us to rank the features according to how useful they are for detecting malicious behaviors. For example, here are the top-5 automatically engineered features (you can click the feature to find related behaviors; the format is explained below):
To stimulate further research into semantics-aware security, we release our semantic network and the automatically generated features.
We are making our data available to the security community. If you use the FeatureSmith data set in your research, don't link to this page; instead, please cite our paper:
Z. Zhu and T. Dumitraș, "FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature," in ACM Conference on Computer and Communications Security (CCS), Vienna, AT, 2016.
Our primary data source consists of scientific papers. We utilize these papers to extract Android malware behaviors and to construct the semantic network. From the electronic proceedings distributed to conference participants, we collect the papers from the IEEE Symposium on Security and Privacy (S&P'08–S&P'15), the Computer Security Foundations Symposium (CSF'00–CSF'14), and USENIX Security (Sec'11). We complement this corpus by searching Google Scholar with the keywords “Android malware”, and then we download the PDF files if a download link is provided in the query results. This process may result in duplicate papers, if a returned paper already exists in our corpus. Therefore, we record the hash of all the papers in our corpus, and remove a PDF document if the file hash already exists in the data set. In total, our corpus includes 1,068 documents. Table below is the summary of our document corpus.
Document source | count |
---|---|
IEEE S&P | 465 |
IEEE CSF | 437 |
USENIX Sec | 35 |
Google Scholar | 241 |
Total | 1,068 |
We extract the text from the papers in PDF format, for later processing. We develop several heuristics to address this problem. We convert the PDF files to text with the Python package pdfminer
.
We collect the malware family names from both the Drebin dataset and from a list of malware families caught by the Mobile-Sandbox analysis platform.
Malware source | count |
---|---|
Drebin | 180 |
Mobile-Sandbox | 210 |
Total | 280 |
In total, we collect 280 malware names. We utilize these names when mining the papers on Android malware to identify sentences that discuss malicious behaviors. In addition to the concrete family names, we also utilize the term “malware” and its variants for this purpose.
We select permissions, intents, and API calls as potential features for malware detection. We collect all the permissions, intents and API calls from Android developer documents (API level 23). Then, we ignore the class name for each feature, because we have found that class names are not mentioned in most papers. Then based on the names, we remove some of features that potentially cause ambiguity like "length". The features are summarized below.
Feature source | count |
---|---|
permissions | 132 |
intents | 189 |
API methods | 11,373 |
Total | 11,694 |
Note that not all the malware and features are mentioned in our document corpus, so the actual number of malware and features is lower than the number we show above.
FeatureSmith represents the knowledge about malware behavior using a 3-layer semantic network. The data is in a sqlite database, which consists of 3 tables: mal2behav
, feat2behav
andbehaviorScore
, described below.
This table describes the links between known malware families and malware behaviors.
Column | Description |
---|---|
malware | The name of malware family |
behavior | Description of behavior, which consists of subject, verb and object. We use "-->" as the delimiter for the three components. For example, "-->delete-->user_data" means that the verb is "delete" and the object is "user data". In this case, the subject is NULL. |
colocation | How many times the behavior and feature appear together within a certain window. In our experiment, we set the window to be 3 sentences. |
This table describes the links between malware behaviors and app features.
Column | Description |
---|---|
feature | The name of feature |
behavior | The same as mal2behav table |
colocation | The same as mal2behav table |
This table describes the initial weight for the behaviors, which we use when learning which behaviors are most closely related to Android malware.
Column | Description |
---|---|
behavior | The same as mal2behav table |
score | The score of behavior |
You can download the database here.
This research was partially supported by the National Science Foundation (grant 5-244780) and by the Maryland Procurement Office (contract H98230-14-C-0127). This website represents the position of the authors and not that of the aforementioned agencies.