Due April 1

Posted: Mar 25, 2014
Last Update: Apr. 1, 2014

Empirical Bayes Methods

Consider the simple normal-normal model with prior mean equal to zero we saw in class:

\[ Y_g = \theta_g + \epsilon_g \\ \epsilon_g \sim N(0, \sigma_g^2) \\ \theta_g \sim N(0, \tau^2), \]

where \(\sigma_g^2\), and \(\tau^2\) are known. We proved in class that given a single observation \(y_g\), the posterior distribution \(\theta_g | y_g\) is \(\theta_g | y_g \sim N(\lambda_g y_g, \lambda_g \sigma_g^2)\) where

\[ \lambda_g = \frac{\tau^2}{\sigma_g^2 + \tau^2} \]

Question 1

Prove that given a set of \(m\) observations \(y_{ig}, \, i=1,\ldots,m\), the posterior distribution \(\theta_g | y_{g1},\ldots, y_{gm}\) is \(\theta_g | y_{g1},\ldots, y_{gm} \sim N(\lambda_g \bar{y}_g, \lambda_g \frac{\sigma^2_g}{m})\), where

\[ \lambda_g = \frac{m\tau^2}{\sigma_g^2+m\tau^2} \]

and \(\bar{y}_g=\frac{1}{m} \sum_i y_{gi}\) is the sample mean. Show your derivation. Comment on how the posterior distribution behaves as \(m\) increases (more observations). Note: You can assume that the data distribution is conditionally independent:

\[ \begin{eqnarray} p(y_{g1},\ldots,y_{gm} | \theta_g) & = & \prod_i p(y_{gi} | \theta_g) \\ {} & = & \prod_i \frac{1}{\sqrt{2\pi\sigma_g^2}} \exp{ \left\{ -\frac{1}{2}\frac{(y_{gi}-\theta_g)^2}{\sigma_g^2} \right\} } \end{eqnarray} \]

Question 2a

Now consider an extension of this simple normal-normal model:

\[ Y_g = \theta_g + \epsilon_g \\ \epsilon_g \sim N(0, \sigma_g^2) \\ \theta_g \sim N(\mu, \tau^2), \]

where \(\mu\), \(\sigma_g^2\), and \(\tau^2\) are known. Prove that given a single observation \(y_g\), the posterior distribution \(\theta_g | y_g\) can be written as \(\theta_g | y_g \sim N(\lambda_g y_g + (1-\lambda_g) \mu, \lambda_g \sigma_g^2)\). Show your derivation.

Question 2b

Same as 2a, but given a set of \(m\) observations \(y_{ig}, \, i=1,\ldots,m\), derive the posterior distribution.

Question 3a

Recall general properties of expectations \(E\left[ a+bX + cY \right]=a + b EX + c EY\) and variance \(Var \left[ a + bX + cY \right] = a^2 + b^2 Var \left[ X \right] + c^2 Var \left[ Y \right]\) for \(X\) and \(Y\) independent.

Show that the marginal distribution of \(\bar{Y}_g=\frac{1}{m}\sum_i Y_{ig}\) for the model in question 2b is \(\bar{Y}_g | \mu \sim N(\mu, \frac{\sigma_g^2}{m} + \tau^2)\).


  1. Compute \(E[\bar{Y}_g]\) and \(Var[\bar{Y}_g]\).
  2. For the latter you should use the law of total variance:

\[ Var[\bar{Y}_g] = E[ Var[\bar{Y}_g|\theta_g] ] + Var[ E[\bar{Y}_g|\theta_g] ] \]

  1. As before, you can assume that the \(Y_{ig}\) are conditionally independent given \(\theta_g\)

Question 3b

Using the above, show that an empirical estimate of \(\mu\) can be obtained as a weighted average of the gene-specific sample means:

\[ \hat{\mu} = \frac{\sum_g w_g \bar{y}_g}{\sum_g w_g}. \]

Derive \(w_g\) and explain how this weight behaves as gene-specific variance \(\sigma_g^2\) varies.

Question 3c

In count data, like RNA-seq, there is a relationship between mean and variance (in our notation, \(\theta_g\) and \(\sigma_g^2\) are not independent. Consider the case where \(Var \left[ Y_g | \theta_g \right] = \theta_g^2 \sigma_g^2\). Argue why the estimate for \(\mu\) derived in 3b is biased towards zero.

Question 4

Suppose you have estimated gene-specific means \(\bar{y}_g\), and, using the Empirical-Bayes method from 3b, have also estimated prior mean \(\hat{\mu}\). Now you want to test the evidence for \(E\bar{y}_g \neq 0\) using these estimates. Write down

\[ \log{\frac{p(\bar{y}_g | \mu \neq 0)}{p(\bar{y}_g | \mu = 0)}} \]

as a function of \(\bar{y}_g\), \(\hat{\mu}\) and \(w_g\) (from question 3b) and show that this is a linear function of \(\bar{y}_g\). Suppose \(\hat{\mu}>0\), for which values of \(\bar{y}_g\) do you have more evidence that \(E \bar{y}_g \neq 0\) than \(E \bar{y}_g = 0\)?

Isoform quantification from RNAseq data

Counting reads in one gene

Let's use our your knowledge about gene model annotations and counting to implement Jiang and Wong's method for isoform quantification from RNA-seq data.

The first thing I need to do is determine which genomic regions correspond to which transcript in a specifc gene. Let's start by looking at a specific gene using the data structures you used in HW 1. We first create a set of disjoint genomic regions for this gene, and also grab the gene's annotated transcripts:

txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
theGene <- "FBgn0000008"

exonsByGene <- exonsBy(txdb,by="gene")
theRegions <- disjoin(exonsByGene[[theGene]])

txsByGene <- transcriptsBy(txdb)
theTxsIds <- values(txsByGene[[theGene]])$tx_id

exonsByTranscript <- exonsBy(txdb,by="tx")
theTxs <- exonsByTranscript[theTxsIds]

Question 1 Write code to create a matrix C with number of rows equal to the length of object theRegions and number of columns equal to the length of object theTxs, defined as follows:

\[ C_{ij}=\begin{cases} 1 & \textrm{if transcript j contains region i} \\ 0 & \mathrm{o.w.} \end{cases} \]

Hint: see what the countOverlaps function does. Show the result of your code.

Download aligned read data for a Drosophila RNA-seq sample from http://cbcb.umd.edu/~hcorrada/CMSC858B/Data/thefile_sorted.bam. This file is in BAM format, a community standard format for storing aligned reads. Download also the read index http://cbcb.umd.edu/~hcorrada/CMSC858B/Data/thefile_sorted.bam.bai for efficient region querying.

Load the aligned reads for a region around this gene as follows:

# the entire genomic region overlapping the disjoint gene regions
theRegion <- range(theRegions)

# load reads in genomic region (extended by 100,000 bases to either side)
theBigRegion <- resize(theRegion,width=width(theRegion)+2e5,fix="center")

# there is a mismatch in chromosome names between the transcript annotation we're using
# and the aligned read data, so let's reconcile them
theBigRegion <- keepSeqlevels(theBigRegion,"chr2R")
theBigRegion <- renameSeqlevels(theBigRegion, c("chr2R"="2R"))

# now load aligned reads for this region 
# assuming the BAM file is in the current directory
aln <- readGAlignments("thefile_sorted.bam", param=ScanBamParam(which=theBigRegion))

# change the chromosome names in the aligned reads object
# to match the transcript annotation
chrNames <- seqlevels(aln)
renameVector <- paste("chr",chrNames,sep="")
names(renameVector) <- chrNames
aln <- renameSeqlevels(aln,renameVector)

# there is a slight disagreement on the chromosome lengths between the reference used to align
# reads and the transcript annotation. Let's drop the chrU and chrUextra chromosomes
namesToKeep <- seqlevels(aln)[!seqlevels(aln) %in% c("chrU","chrUextra")]
aln <- keepSeqlevels(aln, namesToKeep)
theRegions <- keepSeqlevels(theRegions, namesToKeep)

# last thing, strand is meaningless for this RNA-seq dataset, so let's make all the overlap
# computation functions ignore it by setting strand to "*"
strand(aln) <- "*"

Now you have an object containing alignments form ~19k reads over this genomic region. Our goal is to determine isoform-specific expression, so we need to count the number reads aligend to each of the regions defined previously :

olaps <- summarizeOverlaps(theRegions, aln)
counts <- assay(olaps)[,1]

The result is a vector of read counts per genomic region. Now, considering these counts and the C matrix you computed in Question 1:

Question 2 Are all three transcripts for this gene expressed? (Answer yes/no/can't tell for each one, and why).

Estimating transcript abundance for one gene

We are using the method of Jiang and Wong to estimate isoform expression. To do this we need to compute a number \(a_{ij}\) for each region \(i\) and transcript \(j\).

Question 3 How is \(a_{ij}\) defined in this model? Write code to compute this for our gene (you can take the reads we loaded as the entire set of reads, i.e. w=length(aln)). Show the output of your code.

Let's now fit the Jiang and Wong model to estimate isoform-specific expression.

Question 4 Write down the optimization problem to solve in terms of matrix \(A\) computed above. Use \(\theta\) as the parameters to estimate.

Now let's use a standard numerical optimization method to solve the problem accessible through the constrOptim function in R. Look at the documentation for this function.

Question 5 Write function loglik <- function(theta, counts, A){...} that takes a vector of parameters theta as argument and computes the log-likelihood of the count data using matrix \(A\) given current set of parameters theta.

Question 6 Write function loglikGrad <- function(theta, counts, A) {...} that takes a vector parameters theta as argument and computes the gradient of the log-likelihood at current value of parameters theta.

Question 7 Write a function estimateTheta <- function(counts, A) {..} that uses the loglik and loglikGrad functions to compute the Maximum Likelihood Estimate of theta. You need to set an initial set of parameters to pass to constrOptim. Run the estimateTheta function on the counts and \(A\) matrix computed for our gene. Does the result theta correspond to your intuition of how each transcript is expressed (Question 2), why or why not?

# note, to calculate abundance in RPKM (Reads per Kilobase per Million) scale use theta*10^9

Getting the read data

Ignore this if you want: this is how I got the aligned reads from modEncode:

#Download data from modEncode
$ wget ftp://data.modencode.org/all_files/dmel-signal-1/2030_Pupae_accepted_hits.sam.gz

#Change name work file
$ cp 2030_Pupae_accepted_hits.sam.gz thefile.sam.gz

#Convert to BAM
$ samtools view -bS thefile.sam.gz > thefile.bam

#Sort and index the file
$ samtools sort thefile.sam.gz thefile_sorted
$ samtools index thefile_sorted.bam

Handing in

Submit the following files through the submission server http://inclass.umiacs.umd.edu/perl/handin.pl?course=cmsc702: