Due March 13

Posted: Feb. 4 2014
Last Update: Mar. 13 2014

All coding to be done in R.

Part I: Intro to R (10% of HW1 grade)

For this part you will add to the simple analysis on Baltimore arrests we did in class. I have included the R markdown source file in the class github repository. You will be submitting this part of the HW 1 as a pull request (so we can all see each other's work).

The steps

  1. Fork the course github repository.
  2. Think about what analysis you might want to do
  3. Update to latest version of course repository
  4. Update the baltimore.Rmd file accordingly. Here is more information on R markdown
  5. "Knit" a new baltimore.html file. Here is how to do so in Rstudio. See note below.
  6. Repeat step 3
  7. Push your updated version to github
  8. Submit your assignment by making a pull request

The rules

  1. There is a specific template I'm asking you to follow (e.g., use your username as the chunk name to get sane figure names). Please do so.
  2. You can work groups (at most 4 per group) they do not have to be your semester/project group.
  3. You can't repeat an analysis already submitted by another student/group. If you see that somebody else did what you wanted to do when you updated (step 3 above), you need to come up with something new.
  4. Feel free to use more data from the OpenBaltimore data set if you want.

Notes

  1. For non-Rstudio users you can create the html file as follows (assuming you have installed knitr with install.packages):
library(knitr)
knit("baltimore.Rmd")

This should create the html file and figure directory

Part II: Probing genome architecture and organization (40% of grade)

In this part you will start getting used to working with genomic regions: gene models, promoters and CpG islands.

Gene models

In this section you will do some exploratory analysis on gene models in the human genome. To do this you must first obtain a gene annotation (here you will use the UCSC Genome Browser gene models) using Bioconductor:

# Gene models are provided by this library
# you may need to install it with
# biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")
# it may take a while
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

# reassign to shorter variable name
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

# you can extract ranges of interest using
# accessor functions
# see the `GenomicFeatures` package to see how they
# work

# here are some ranges of interest

# genes: GRanges object
allGenes <- genes(txdb)

# exons: GRanges object
allExons <- exons(txdb)

# transcripts: GRangesList object
allTranscripts <- transcripts(txdb)

# exons grouped by gene: GRangesList
exonsByGene <- exonsBy(txdb, by="gene")

# transcripts grouped by gene: GRangesList
transcriptsByGene <- transcriptsBy(txdb, by="gene")

# exons grouped by transcripts: GRangesList
exonsByTx <- exonsBy(txdb, by="tx")

Please answer the following questions using text and a plot or table when appropriate. Include code show how you calculated this. The easiest way to do this is to use R markdown.

Question 1 How many genes are there included in this annotation?
Question 2 What is the average number of exons per gene?
Question 3 What is the number of transcripts per gene?
Question 4 What is the exon length distribution?
Question 5 What is the transcript length distribution?
Question 6 What is the gene length distribution?
Question 7 How many genes overlap on opposite strands?
Question 8 Does the gene length distribution differ acrros strands?

Promoters and CpG Islands

Exercise 1 Next you will construct a set of ranges corresponding to promoter regions for all transcripts. To do so you must first obtain the transcription start site (TSS) and strand of each transcript, and then define promoter regions as 1000bp upstream and 200bp downstream from the TSS. Include code in your document that creates promoter regions.

Next, download regions corresponding to the CpG islands annotation in the UCSC browser.

# the AnnotationHub package is an excellent resource for genome annotation data
# install with
# biocLite("AnnotationHub")
library(AnnotationHub)
ah <- AnnotationHub()
cgi <- ah$goldenpath.hg19.database.cpgIslandExt_0.0.1.RData

Question 9 What percentage of UCSC CpG islands overlap promoter regions?

Sequence characteristics

In this section you will compute the statistics required for creating an HMM to find CpG islands. You will do this only for chromosome 16. First you will need to get the DNA sequence for Human chromosome 16.

# This package has the DNA human genome reference sequence
# install with
# this will really take a while
# biocLite("BSgenome.Hsapiens.UCSC.hg19")
library(BSgenome.Hsapiens.UCSC.hg19)

# now get chromosome 16 as a `DNAString` object
chr16 <- Hsapiens[["chr16"]]

The IRanges and Biostrings packages provide a number of utilities that make the calculations you will perform next fairly efficient. These are some important ones: IRanges::breakInChunks(l,w) defines non-overlapping intervals of width \(w\) that cover a region of length \(l\); Views(s, r) defines views into DNAString \(s\) defined by ranges \(r\), this defines, but does not copy, substrings of \(s\) defined by start and end indices; letterFrequency(v, letters) calculates the number of occurences of each element in letters in string views v.

Exercise 2. Compute the number of A,C,G,T over non-overlapping windows of length L=256.
Exercise 3. Compute the number of CG dinucleotides over non-overlapping windows of length L=256.
Exercise 4. Compute the observed/expected (O/E) ratio over non-overlapping windows of length L=256.

Exercise 5. Plot the GC content (N_C+N_G) inside and outside promoter regions.
Exercise 6. Plot the CpG content (N_CG) inside and outside promoter regions.
Exercise 7. Plot the O/E ratio inside and outside the UCSC CpG islands.
Exercise 8. Finally, make a plot that shows that the distribution of N_CG depends on the GC content (N_C + N_G) in a given window. We saw a plot like this in class.

Again, make sure to include code for every exercise.

Part III: CpG island finding with HMMs (50% of hw grade)

In this section you will implement the Hidden Markov Model we discussed in class. We will be building the model piece by piece and you will compare the resulting CpG Island regions as you add components to the model. As a warmup, we will begin with two simple models: the Gardiner-Garden and Frommer (GGF) rule, and a simple conditional (mixture) model for CpG rate.

Exercise 1 Compute the statistics you will use for this part using the code you wrote in Part II, but now computing GC content and CpG content over non-overlapping windows of length L = 16.

Using the GFF rule:

Define CpG islands using the GFF rule:

Find a set of segments \(s_1,\ldots,s_K\) where \(K \times L > 200\), where for all \(s_j\), \(j\in [1,K]\):

\[ \begin{eqnarray} \frac{N_G(s_j)}{L} + \frac{N_C(s_j)}{L} & > .5 \\ \frac{\frac{N_{CG}(s_j)}{L}}{\frac{N_G(s_j)}{L} \times \frac{N_C(s_j)}{L}}& > .6 \end{eqnarray} \]

Create a GRanges object from the islands you find using the GFF rule.

# assuming the start positions are in vector 'start' and
# end positions are in vector 'end', you can create an object 
# like so:
GFFislands <- GRanges(seqnames="chr16", 
                   ranges=IRanges(start=start,end=end))

Exercise 2 Compare and discuss the overlap between these islands and the UCSC islands you downloaded in Part II.

A simple conditional state (mixture) model

Here we will use a simple conditional state model which captures the bimodality of CpG rate, but doesn't capture dependence between the states of consecutive segments.

We first define state variables \(Y\) as we did in class as indicators of a segment having "high" vs. "baseline" CpG rate:

\[ Y(s) = \left\{ \begin{eqnarray} 1 & \textrm{if } s \textrm{ has "high" CpG rate} \\ 0 & \textrm{otherwise} \end{eqnarray} \right. \]

and define the conditional model of CpG rate:

\[ [N_{CG}(s) | Y(s) = i] \sim \mathrm{Binomial} \left( \frac{1}{4} p(s)^2 \times a_i, L \right), \]

where \(p(s)\) is an approximation of GC-content which will simplify things later on:

\[ p(s) = \frac{1}{2} \frac{N_C(s) + N_G(s)}{L} \]

This is a mixture model, the distribution of \(N_CG(s)\) is a mixture of the "high" CpG rate and "baseline" CpG rate distributions. We can also interpret it as a generative model as follows: for segment \(s\), decide if \(s\) is in the "high" CpG rate state with probability \(\pi_y\), and the generate \(N_{CG}(s)\) according to the models above. To fully specify this model, we need to estimate parameters \(\pi_y\), \(a_0\) and \(a_1\), from observed CpG counts \(N_{CG}(s)\) and GC content proportion \(p(s)\). We will choose these parameters to be those that maximize the likelihood, under this mixture model, of the observed counts.

We will use a modification of the Expectation Maximization algorithm, an iterative method we will study in more detail later in the course. The overall scheme is the following:

1. Initialize parameters
2. Iterate until convergence
   a. Maximize likelihood w.r.t Y
   b. Maximize likelihood w.r.t. parameters

Steps (a) and (b) require a likelihood function of data \(\{N_{CG}(s), Y(s)\}\) given parameters \(\{\pi_y,a_0,a_1\}\). We'll use this to maximize likelihood as a function of parameters, so we'll write it as

\[ L(\pi_y,a_0,a_1; N_{CG}(s), Y(s)) = \prod_s \left[ \pi_y f(a_1; N_{CG}(s)) \right]^{Y(s)} \times \left[(1-\pi_y) f(a_0; N_{CG}(s)) \right]^{(1-Y(s))} \]

where \(f\) is the binomial density function:

\[ f(a_i; N_{CG}(s)) = {{L} \choose {N_{CG}(s)}} (p(s)a_i)^{N_{CG}(s)} (1-p(s)a_i)^{L-N_{CG}(s)} \]

The product over segments \(s\) results from assuming segments are independent. Notice that since \(Y(s)\) is 0 or 1, this function is the binomial density function for the corresponding model weighted by the state probability. For instance if \(Y(s)=1\), then the corresponding term in the right hand side is

\[ \pi_y f(a_1; N_{CG}(s)) \]

Step (a) assigns \(Y(s)\) for each segment, i.e., specifies if the segment has "high" or "baseline" CpG rate. From the note above, we can maximize each term in the product independently by comparing the two models for the corresponding segment (which we can do since we have a specific value of the model parameters) and choose the model that best fits each segment. Set \(Y(s)=1\) if

\[ \pi_y f(a_1; N_{CG}(s)) > (1-\pi_y) f(a_0; N_{CG}(s)) \]

and \(Y(s)=0\) otherwise.

For (b) we need to choose parameters \(\pi_y\), \(a_0\) and \(a_1\). For \(\pi_y\) we use the proportion of segments where \(Y(s)=1\), i.e. \(\frac{1}{S} \sum_s Y(s)\) where \(S\) is the number of segments.

To estimate \(a_1\), we maximize \(L\) with respect to \(a_1\). We can see from the likelihood function that only segments where \(Y(s)=1\) include \(a_1\). Also, it will be easier to maximize the log of function \(L\), so we need to optimize the following function:

\[ \sum_{s:\, Y(s) = 1} \log \pi_y + \log f(a_1; N_{CG}(s)) \]

The first term in the sum does not depend on \(a_1\), so the function to optimize is:

\[ \sum_{s: \, Y(s) =1} \log{{L} \choose {N_{CG}(s)}} + N_{CG}(s) \log{ \frac{p(s) a_1}{(1-p(s) a_1)} } + L \log{(1-p(s) a_1)} \]

Again, the first term is a constant that does not depend on \(a_1\) and the second term follows from a little algebra. Later in the course we will derive closed form solutions for this kind of problem, but in this project we will use numerical optimization, which \(R\) provides with function optimize. In summary, to implement step (b) for \(a_1\) you would use function optimize to minimize function

\[ f(a_1) = -\sum_{s: \, Y(s)=1} N_{CG}(s) \log{ \frac{p(s) a_1}{(1-p(s) a_1)} } - L \log{ (1-p(s) a_1) } \].

One note about the implementation, this function is ill-defined when \(p(s) a_1 = 1\) or \(p(s) a_1 = 0\), so to avoid these issues you should define a utility function similar to this:

dampen <- function(a, tol=1e-6) pmax(tol, pmin(1-tol, p*a))

where p is \(p(s)\). Estimate \(a_0\) the same way, but using segments where \(Y(s)=0\).

To initialize parameters, use (the somewhat arbitrary) values: \(\pi_y=0.2\), \(a_1=0.6\), \(a_0=0.05\). Stop the iterative procedure when the assignments \(Y(s)\) stop changing, or you reach a maximum number of iterations, say 100.

Implement this method in function simpleMixtureCGI as follows:

# implements a simple mixture model for CpG island finding
# arguments
#   ncg: vector of CG counts in segments (N_CG(s))
#   p: CG proportion in segments (p(s))
#   L: segment lengths
#   maxit: maximum number of iterations
#
# returns a list with following components
#   y: state assignements for each segment
#   pi: estimated state probability 
#   a1: O/E ratio for "high" CpG rate regions
#   a0: O/E ratio for "baseline" CpG rate regions
simpleMixtureCGI <- function(ncg, p, L, maxit = 100) {
   # initialize parameters
   # repeat until convergence or maximum number of iterations is reached
     # assign variables y
     # update parameter estimates

   # return final states y, and estimates pi, a1, a0        
}

Finally write a function that segments the genome based on state variables \(Y(s)\):

# implemets genome segmentation
# arguments:
#   y: state assignments, vector of 1 and 0
#   L: segment length used to estimate states
#   minSegmentLength: minimum genomic length of segments returned
#
# returns:
#   GRanges object for contiguous segments where y=1 with length greater than minSegmentLength
segmentGenome <- function(y, L, minSegmentLength = 200) {
          
}

Find CpG Islands using the above functions and compare to the UCSC regions: how much do they overlap? are the regions you find with this method shorter or longer? How does the \(a_1\) parameter estimated with this method compare to the parameter used in the rule above? The GFF rule above uses GC-content to define CpG Islands but simpleMixtureCGI does not. How would you change it to incorporate this, in the simpleMixtureCGI function or the segmentGenome function?

An HMM for CpG Island finding

In this section you extend your code above to implement an HMM model that captures dependence between neighboring genomic segments. The core idea is that now we formulate a probability model for state transitions, e.g., \(P(Y(s_{l+1}) = 1 | Y(s_{l})=1) = \pi_y^1\), and \(P(Y(s_{l+1})=1 | Y(s_l) = 0) = \pi_y^0\).

Our log-likelihood function expands to accomodate these transition probabilities:

\[ \begin{eqnarray} \log L(\pi_y,\pi_y^1,\pi_y^0,a_0,a_1; N_{CG}(s), Y(s)) & = & Y(s_0) \times \left[ \log{\pi_y} + \log{ f(a_1; N_{CG}(s_0)) } \right] \\ {} & + & {(1-Y(s_0))} \times \left[\log{(1-\pi_y)} + \log{ f(a_0; N_{CG}(s_0)) } \right] \\ {} & + & \sum_{l=1}^S {Y(s_l)Y(s_{l-1})} \times \left[ \log{ \pi_y^1 } + \log{ f(a_1; N_{CG}(s_l)) } \right] \\ {} & {} & + {(1-Y(s_l))Y(s_{l-1})} \times \left[ \log{ (1-\pi_y^1) } + \log{ f(a_0; N_{CG}(s_l) } \right] \\ {} & {} & + {Y(s_l)(1-Y(s_{l-1}))} \times \left[ \log{ \pi_y^0 } + \log{ f(a_1; N_{CG}(s_l) } \right] \\ {} & {} & + {(1-Y(s_l))(1-Y(s_{l-1}))} \times \left[ \log{ (1-\pi_y^0) } + \log{ f(a_0; N_{CG}(s_l) } \right] \end{eqnarray} \]

We still use the same framework: we set state variables \(Y(s)\) in step (1), and estimate model parameters in step (2). Consider step (2) first. You can show that the maximization problem to estimate parameters \(a_0\) and \(a_1\) is unchanged. To estimate new parameters \(\pi_Y^i\) we again use the observed proportion of times each transition occurs in the current setting of state variables \(Y(s)\).

To find the setting of state variables \(Y(s)\) we need to consider all possible state sequences \(\{Y(s_0),Y(s_1),\ldots,Y(s_S)\}\) which are not independent as in the model you used in the previous part. However, the log-likelihood function factorizes by segment and the \(\max\) operation distributes across the sum across segments suggesting that a dynamic programming procedure can be used (in fact, this is the famous Viterbi algorithm). Construct a dynamic programming table \(d\) where entry \(d_{ij}\) corresponds to the maximum log-likelihood over all state sequences of length \(j+1\) that end in state \(Y(s_j) = i\). To initalize set \(d_{10} = \log{\pi_y} + \log{f(a_1; N_{CG}(0))}\) and \(d_{00} = \log{(1-\pi_y)} + \log{f(a_0; N_{CG}(0))}\). Then calculate \[ d_{1j+1} = \max_{i} \left\{ \log{ \pi_y^i } + \log{ f(a_1, N_CG(s_{j+1})) } + d_{ij} \right\} \].

This corresponds to extending the maximum likelihood sequences of length \(j\) and choosing the extension that maximizes the resulting log-likelihood . The same holds for \(d_{0j}\). To reconstruct the optimal state sequence you need to backtrack from the last column of table \(d\).

Write a function to implement this model as follows:

# implements an HMM model for CpG island finding
# arguments
#   ncg: vector of CG counts in segments (N_CG(s))
#   p: CG proportion in segments (p(s))
#   L: segment lengths
#   maxit: maximum number of iterations
#
# returns a list with following components
#   y: state assignements for each segment
#   pi: estimated state probability 
#   a1: O/E ratio for "high" CpG rate regions
#   a0: O/E ratio for "baseline" CpG rate regions
hmmCGI <- function(ncg, p, L, maxit = 100) {
   # initialize parameters
   # repeat until convergence or maximum number of iterations is reached
     # assign variables y (using dynamic programming)
     # update parameter estimates

   # return final states y, and estimates pi, a1, a0        
}

Use the segmentGenome from the previous part to again segment human chromosome 16 to find CpG Islands. Compare and discuss the islands resulting from the three models you have now implemented (GFFIslands, simpleMixtureCGI and hmmCGI). You should consider addressing the following in this discussion: length characteristics of resulting islands, overlap between different set of islands, overlap between islands and promoter regions, distance between resulting island and nearest transcription start sites.

You may find the epivizr package (developed at CBCB) useful to guide your comparison. You can use it to visualize the resulting segmentation and the underlying data. For example, the following code would allow you to visualize your three set of islands:

# remember to install epivizr with biocLite
# if it's not installed yet
require(epivizr)

# this requires internet access
mgr <- startEpiviz(chr="chr16", start=4000000,8000000)

gffDev <- mgr$addDevice(GFFIslands, "GFF Islands")

mixtureIslands <- simpleMixtureCGI(ncg, p, L, maxit=100)
mixtureDev <- mgr$addDevice(mixtureIslands, "Mixture Islands")

hmmIslands <- hmmCGI(ncg, p, L, maxit=100)
hmmDev <- mgr$addDevice(hmmIslands, "HMM Islands")

mgr$service()
# you can navigate around chr16 on your web browser
# run this when done
mgr$stopServer()

You are welcome to include example screenshots from this as part of your analysis.

Handing in:

Part 1 will be graded based on your submissions to the github page.

For Part 2, submit files part2.Rmd (or part2.Rnw) and part2.pdf. The former contains code and your discussion for the questions and exercies in part 2. The latter is the result of knitting (using knitr) your file to pdf:

require(knitr)
knit2pdf("part2.Rmd")

You can optionally include any additional code to support your analysis in file part2.R.

For Part 3, submit the following files:

  1. simpleMixtureCGI.R: An R script containing only the code for the simple mixture model.
  2. segmentGenome.R: An R script containing only the genome segmentation function.
  3. hmmCGI.R: An R script containing only the code for the HMM model.
  4. part3.Rmd (or part3.Rnw): Analysis code and text for part3. This will include the code to define CpG Islands using the GFF rule.
  5. part3.pdf: The result of "knitting" the previous file.
  6. part3.R: Optional script containing any additional code to support your analysis.

Please submit all of these files to the Handin server: http://inclass.umiacs.umd.edu/perl/handin.pl?course=cmsc702. It will start accepting submissions in a day or so.

Recall that you can discuss and work on this project with your group partners, but you must submit your own code and analysis.