Due May 9

Posted: May 1, 2014 Last Update: May 1, 2014

Phenotype prediction competition

We are running a Kaggle Competition for predicting phenotype from genotypes. You will receive genotype (about ~28K SNPs) and phenotype data for 800 samples. You will also receive genotypes for 200 more test samples. Your task is a) to produce phenotype predictions for the 200 samples. You are free to use any method, new or existing, to do this. You have to submit a writeup explaining what method you used to make predictions and to estimate accuracy. Please provide citations where appropriate. Bonus grade points will be awarded to the top five most accurate predictions. Additionally, extra bonus points will be given to the most creative approach.


You can download all the data for the competition from the Kaggle page. You can also obtain the data here:


The zip file contains four files:

  1. genotypes.csv: SNP data, 800 samples x 28,500 SNPs. The first row contains SNP identifiers, the first column contains sample ids. Genotypes are specified as A/A, A/B, or B/B, corresponding to major/minor status of each allele. You are free to encode this in your classifier any way you want.

  2. phenotypes.csv: Phenotype data, 800 samples x 2 columns. The first row contains variable names, the first column contains sample ids. Variable cc indicates case-control status for each sample (this is simulated data). The stratum variable indicates population (CEPH: caucassian, JPT+CHB: japanese and chinese). Your task is to predict disease status (cc).

  3. annotation.csv: SNP annotation table, 28,500 SNPs x 3 columns. The first row contains variable names, the first column contains SNP identifiers. Variable are chromosome and SNP position. The other two columns give the nucleotide for major allele (A1) and minor allele (A2).

  4. test_genotypes.csv: SNP data, 200 samples x 28,500 SNPs. These are the samples you will provide disease state predictions for.

What to submit

  1. Predictions: through Kaggle page

On the handin site: 2. Writeup: describe the method you used to make predictions. Call this file writeup.pdf. 3. Code: code you used to get your predictions in a zip file: code.zip