This site provides supplemental information for the paper FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature, by Ziyun Zhu and Tudor Dumitraș. This paper describes the design of a system that can generate, without human intervention, features for training machine learning classifiers to detect Android malware malware. FeatureSmith achieves this by synthesizing the security knowledge described in natural language documents, such as papers published in security conferences and journals.

The key data structure in FeatureSmith is the semantic network, which encodes the knowledge about malware behaviors reflected in our corpus of documents. The semantic network has three types of nodes:

The links among these nodes reflect concepts that are semantically related and allow us to rank the features according to how useful they are for detecting malicious behaviors. For example, here are the top-5 automatically engineered features (you can click the feature to find related behaviors; the format is explained below):

  1. sendTextMessage
  2. SEND_SMS
  3. BOOT_COMPLETED
  4. RECEIVE_SMS
  5. onStart

To stimulate further research into semantics-aware security, we release our semantic network and the automatically generated features.

Sharing and Attribution Terms

We are making our data available to the security community. If you use the FeatureSmith data set in your research, don't link to this page; instead, please cite our paper:

Z. Zhu and T. Dumitraș, "FeatureSmith: Automatically Engineering Features for Malware Detection by Mining the Security Literature," in ACM Conference on Computer and Communications Security (CCS), Vienna, AT, 2016.

Data Sources and Collection Method

Our primary data source consists of scientific papers. We utilize these papers to extract Android malware behaviors and to construct the semantic network. From the electronic proceedings distributed to conference participants, we collect the papers from the IEEE Symposium on Security and Privacy (S&P'08–S&P'15), the Computer Security Foundations Symposium (CSF'00–CSF'14), and USENIX Security (Sec'11). We complement this corpus by searching Google Scholar with the keywords “Android malware”, and then we download the PDF files if a download link is provided in the query results. This process may result in duplicate papers, if a returned paper already exists in our corpus. Therefore, we record the hash of all the papers in our corpus, and remove a PDF document if the file hash already exists in the data set. In total, our corpus includes 1,068 documents. Table below is the summary of our document corpus.

Document source count
IEEE S&P 465
IEEE CSF 437
USENIX Sec 35
Google Scholar 241
Total 1,068

We extract the text from the papers in PDF format, for later processing. We develop several heuristics to address this problem. We convert the PDF files to text with the Python package pdfminer.

We collect the malware family names from both the Drebin dataset and from a list of malware families caught by the Mobile-Sandbox analysis platform.

Malware source count
Drebin 180
Mobile-Sandbox 210
Total 280

In total, we collect 280 malware names. We utilize these names when mining the papers on Android malware to identify sentences that discuss malicious behaviors. In addition to the concrete family names, we also utilize the term “malware” and its variants for this purpose.

We select permissions, intents, and API calls as potential features for malware detection. We collect all the permissions, intents and API calls from Android developer documents (API level 23). Then, we ignore the class name for each feature, because we have found that class names are not mentioned in most papers. Then based on the names, we remove some of features that potentially cause ambiguity like "length". The features are summarized below.

Feature source count
permissions 132
intents 189
API methods 11,373
Total 11,694

Note that not all the malware and features are mentioned in our document corpus, so the actual number of malware and features is lower than the number we show above.

Semantic Network

FeatureSmith represents the knowledge about malware behavior using a 3-layer semantic network. The data is in a sqlite database, which consists of 3 tables: mal2behav, feat2behav andbehaviorScore, described below.

mal2behav

This table describes the links between known malware families and malware behaviors.

Column Description
malware The name of malware family
behavior Description of behavior, which consists of subject, verb and object. We use "-->" as the delimiter for the three components. For example, "-->delete-->user_data" means that the verb is "delete" and the object is "user data". In this case, the subject is NULL.
colocation How many times the behavior and feature appear together within a certain window. In our experiment, we set the window to be 3 sentences.

feat2behav

This table describes the links between malware behaviors and app features.

Column Description
feature The name of feature
behavior The same as mal2behav table
colocation The same as mal2behav table

behaviorScore

This table describes the initial weight for the behaviors, which we use when learning which behaviors are most closely related to Android malware.

Column Description
behavior The same as mal2behav table
score The score of behavior

You can download the database here.

Acknowledgments

This research was partially supported by the National Science Foundation (grant 5-244780) and by the Maryland Procurement Office (contract H98230-14-C-0127). This website represents the position of the authors and not that of the aforementioned agencies.

The following institutions were given access

  1. University of Illinois at Urbana-Champaign, USA
  2. University of Melbourne, Australia
  3. Shanghai Jiao Tong University, China
  4. University of Virginia, USA
  5. The Hong Kong Polytechnic University, China
  6. Chinese Academy of Sciences, China
  7. Bangladesh University of Engineering & Technology, Bangladesh
  8. University of Maryland, Baltimore County, USA
  9. Daffodil International University, Bangladesh
  10. FAST National University of Computer and Emerging Sciences, Peshawar, Pakistan
  11. University of Technology Sydney, Australia
  12. Arizona State University, USA
  13. University of Luxembourg, Luxembourg
  14. University of British Columbia, Canada
  15. Universiti Teknikal Malaysia Melaka, Malaysia