*Due April 1*

Posted: Mar 25, 2014

Last Update: Apr. 1, 2014

Consider the simple normal-normal model with prior mean equal to zero we saw in class:

\[ Y_g = \theta_g + \epsilon_g \\ \epsilon_g \sim N(0, \sigma_g^2) \\ \theta_g \sim N(0, \tau^2), \]

where \(\sigma_g^2\), and \(\tau^2\) are *known*. We proved in class that given a single observation \(y_g\), the posterior distribution \(\theta_g | y_g\) is \(\theta_g | y_g \sim N(\lambda_g y_g, \lambda_g \sigma_g^2)\) where

\[ \lambda_g = \frac{\tau^2}{\sigma_g^2 + \tau^2} \]

**Question 1**

Prove that given a set of \(m\) observations \(y_{ig}, \, i=1,\ldots,m\), the posterior distribution \(\theta_g | y_{g1},\ldots, y_{gm}\) is \(\theta_g | y_{g1},\ldots, y_{gm} \sim N(\lambda_g \bar{y}_g, \lambda_g \frac{\sigma^2_g}{m})\), where

\[ \lambda_g = \frac{m\tau^2}{\sigma_g^2+m\tau^2} \]

and \(\bar{y}_g=\frac{1}{m} \sum_i y_{gi}\) is the sample mean. Show your derivation. Comment on how the posterior distribution behaves as \(m\) increases (more observations). Note: You can assume that the data distribution is conditionally independent:

\[ \begin{eqnarray} p(y_{g1},\ldots,y_{gm} | \theta_g) & = & \prod_i p(y_{gi} | \theta_g) \\ {} & = & \prod_i \frac{1}{\sqrt{2\pi\sigma_g^2}} \exp{ \left\{ -\frac{1}{2}\frac{(y_{gi}-\theta_g)^2}{\sigma_g^2} \right\} } \end{eqnarray} \]

**Question 2a**

Now consider an extension of this simple normal-normal model:

\[ Y_g = \theta_g + \epsilon_g \\ \epsilon_g \sim N(0, \sigma_g^2) \\ \theta_g \sim N(\mu, \tau^2), \]

where \(\mu\), \(\sigma_g^2\), and \(\tau^2\) are *known*. Prove that given a single observation \(y_g\), the posterior distribution \(\theta_g | y_g\) can be written as \(\theta_g | y_g \sim N(\lambda_g y_g + (1-\lambda_g) \mu, \lambda_g \sigma_g^2)\). Show your derivation.

**Question 2b**

Same as 2a, but given a set of \(m\) observations \(y_{ig}, \, i=1,\ldots,m\), derive the posterior distribution.

**Question 3a**

Recall general properties of expectations \(E\left[ a+bX + cY \right]=a + b EX + c EY\) and variance \(Var \left[ a + bX + cY \right] = a^2 + b^2 Var \left[ X \right] + c^2 Var \left[ Y \right]\) for \(X\) and \(Y\) independent.

Show that the *marginal* distribution of \(\bar{Y}_g=\frac{1}{m}\sum_i Y_{ig}\) for the model in question 2b is \(\bar{Y}_g | \mu \sim N(\mu, \frac{\sigma_g^2}{m} + \tau^2)\).

Hints:

- Compute \(E[\bar{Y}_g]\) and \(Var[\bar{Y}_g]\).
- For the latter you should use the law of total variance:

\[ Var[\bar{Y}_g] = E[ Var[\bar{Y}_g|\theta_g] ] + Var[ E[\bar{Y}_g|\theta_g] ] \]

- As before, you can assume that the \(Y_{ig}\) are conditionally independent given \(\theta_g\)

**Question 3b**

Using the above, show that an empirical estimate of \(\mu\) can be obtained as a weighted average of the gene-specific sample means:

\[ \hat{\mu} = \frac{\sum_g w_g \bar{y}_g}{\sum_g w_g}. \]

Derive \(w_g\) and explain how this weight behaves as gene-specific variance \(\sigma_g^2\) varies.

**Question 3c**

In count data, like RNA-seq, there is a relationship between mean and variance (in our notation, \(\theta_g\) and \(\sigma_g^2\) are not independent. Consider the case where \(Var \left[ Y_g | \theta_g \right] = \theta_g^2 \sigma_g^2\). Argue why the estimate for \(\mu\) derived in 3b is biased towards zero.

**Question 4**

Suppose you have estimated gene-specific means \(\bar{y}_g\), and, using the Empirical-Bayes method from 3b, have also estimated prior mean \(\hat{\mu}\). Now you want to test the evidence for \(E\bar{y}_g \neq 0\) using these estimates. Write down

\[ \log{\frac{p(\bar{y}_g | \mu \neq 0)}{p(\bar{y}_g | \mu = 0)}} \]

as a function of \(\bar{y}_g\), \(\hat{\mu}\) and \(w_g\) (from question 3b) and show that this is a linear function of \(\bar{y}_g\). Suppose \(\hat{\mu}>0\), for which values of \(\bar{y}_g\) do you have more evidence that \(E \bar{y}_g \neq 0\) than \(E \bar{y}_g = 0\)?

Let's use our your knowledge about gene model annotations and counting to implement Jiang and Wong's method for isoform quantification from RNA-seq data.

The first thing I need to do is determine which genomic regions correspond to which transcript in a specifc gene. Let's start by looking at a specific gene using the data structures you used in HW 1. We first create a set of disjoint genomic regions for this gene, and also grab the gene's annotated transcripts:

```
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
theGene <- "FBgn0000008"
exonsByGene <- exonsBy(txdb,by="gene")
theRegions <- disjoin(exonsByGene[[theGene]])
txsByGene <- transcriptsBy(txdb)
theTxsIds <- values(txsByGene[[theGene]])$tx_id
exonsByTranscript <- exonsBy(txdb,by="tx")
theTxs <- exonsByTranscript[theTxsIds]
```

**Question 1** Write code to create a matrix `C`

with number of rows equal to the length of object `theRegions`

and number of columns equal to the length of object `theTxs`

, defined as follows:

\[ C_{ij}=\begin{cases} 1 & \textrm{if transcript j contains region i} \\ 0 & \mathrm{o.w.} \end{cases} \]

Hint: see what the `countOverlaps`

function does. Show the result of your code.

Download aligned read data for a Drosophila RNA-seq sample from http://cbcb.umd.edu/~hcorrada/CMSC858B/Data/thefile_sorted.bam. This file is in BAM format, a community standard format for storing aligned reads. Download also the read index http://cbcb.umd.edu/~hcorrada/CMSC858B/Data/thefile_sorted.bam.bai for efficient region querying.

Load the aligned reads for a region around this gene as follows:

```
library(ShortRead)
# the entire genomic region overlapping the disjoint gene regions
theRegion <- range(theRegions)
# load reads in genomic region (extended by 100,000 bases to either side)
theBigRegion <- resize(theRegion,width=width(theRegion)+2e5,fix="center")
# there is a mismatch in chromosome names between the transcript annotation we're using
# and the aligned read data, so let's reconcile them
theBigRegion <- keepSeqlevels(theBigRegion,"chr2R")
theBigRegion <- renameSeqlevels(theBigRegion, c("chr2R"="2R"))
# now load aligned reads for this region
# assuming the BAM file is in the current directory
aln <- readGAlignments("thefile_sorted.bam", param=ScanBamParam(which=theBigRegion))
# change the chromosome names in the aligned reads object
# to match the transcript annotation
chrNames <- seqlevels(aln)
renameVector <- paste("chr",chrNames,sep="")
names(renameVector) <- chrNames
aln <- renameSeqlevels(aln,renameVector)
# there is a slight disagreement on the chromosome lengths between the reference used to align
# reads and the transcript annotation. Let's drop the chrU and chrUextra chromosomes
namesToKeep <- seqlevels(aln)[!seqlevels(aln) %in% c("chrU","chrUextra")]
aln <- keepSeqlevels(aln, namesToKeep)
theRegions <- keepSeqlevels(theRegions, namesToKeep)
# last thing, strand is meaningless for this RNA-seq dataset, so let's make all the overlap
# computation functions ignore it by setting strand to "*"
strand(aln) <- "*"
```

Now you have an object containing alignments form ~19k reads over this genomic region. Our goal is to determine isoform-specific expression, so we need to count the number reads aligend to each of the regions defined previously :

```
olaps <- summarizeOverlaps(theRegions, aln)
counts <- assay(olaps)[,1]
```

The result is a vector of read counts per genomic region. Now, considering these counts and the `C`

matrix you computed in Question 1:

**Question 2** Are all three transcripts for this gene expressed? (Answer yes/no/can't tell for each one, and why).

We are using the method of Jiang and Wong to estimate isoform expression. To do this we need to compute a number \(a_{ij}\) for each region \(i\) and transcript \(j\).

**Question 3** How is \(a_{ij}\) defined in this model? Write code to compute this for our gene (you can take the reads we loaded as the entire set of reads, i.e. `w=length(aln)`

). Show the output of your code.

Let's now fit the Jiang and Wong model to estimate isoform-specific expression.

**Question 4** Write down the optimization problem to solve in terms of matrix \(A\) computed above. Use \(\theta\) as the parameters to estimate.

Now let's use a standard numerical optimization method to solve the problem accessible through the `constrOptim`

function in R. Look at the documentation for this function.

**Question 5** Write function `loglik <- function(theta, counts, A){...}`

that takes a vector of parameters `theta`

as argument and computes the log-likelihood of the count data using matrix \(A\) given current set of parameters `theta`

.

**Question 6** Write function `loglikGrad <- function(theta, counts, A) {...}`

that takes a vector parameters `theta`

as argument and computes the gradient of the log-likelihood at current value of parameters `theta`

.

**Question 7** Write a function `estimateTheta <- function(counts, A) {..}`

that uses the `loglik`

and `loglikGrad`

functions to compute the Maximum Likelihood Estimate of `theta`

. You need to set an initial set of parameters to pass to `constrOptim`

. Run the `estimateTheta`

function on the counts and \(A\) matrix computed for our gene. Does the result `theta`

correspond to your intuition of how each transcript is expressed (Question 2), why or why not?

`# note, to calculate abundance in RPKM (Reads per Kilobase per Million) scale use theta*10^9`

Ignore this if you want: this is how I got the aligned reads from modEncode:

```
#Download data from modEncode
$ wget ftp://data.modencode.org/all_files/dmel-signal-1/2030_Pupae_accepted_hits.sam.gz
#Change name work file
$ cp 2030_Pupae_accepted_hits.sam.gz thefile.sam.gz
#Convert to BAM
$ samtools view -bS thefile.sam.gz > thefile.bam
#Sort and index the file
$ samtools sort thefile.sam.gz thefile_sorted
$ samtools index thefile_sorted.bam
```

Submit the following files through the submission server http://inclass.umiacs.umd.edu/perl/handin.pl?course=cmsc702:

- part1.pdf: Containing answers to part 1. This can be a scan of a handwritten answer.
- part2.Rmd: R markdown file containing code and text
- part2.pdf: The result of knitting your Rmd file