DESeq2 normalized counts for HIerarchical clustering and PCA

RV

unread,

Jul 26, 2016, 11:34:17 AM7/26/16

to OpenSequencing

Hi,

I have been using DESeq2 to perform Differential gene expression analysis and I use a counts matrix obtained from HTseq or feature counts software. I use the following code :

countdata <- read.delim("counts_all.txt", header=TRUE, row.names=1)
#convert to matrix
countdata <- as.matrix(countdata)
head(countdata)
# Assign condition 
condition <- factor(c(rep("Treated", 24), rep("Control", 3)), levels = c("Control","Treated"))
featureData <- colnames(countdata)


# Analysis with DESeq2 ----------------------------------------------------


library(DESeq2)
library(calibrate)
library(org.Mm.eg.db)
# Create a coldata frame and instantiate the DESeqDataSet. -----------> Step1 
coldata <- data.frame(row.names=colnames(countdata), condition)
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
dds
# Run the DESeq pipeline. The adjusted p-value < 0.1 -------------> Step2 
dds <- DESeq(dds)
# Get differential expression results
res <- results(dds)
# rlog transformation
 # This uses the dds post DESeq
rld<- rlogTransformation(dds, blind=TRUE)
library("pheatmap")
select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]

pheatmap(assay(rld)[select,])

This is to do a pairwise comparison between the treatment and control. I have a time series experiment where I need I have 4 different time point, hence I have 24 treatments (replicates at different time points ) and the same 3 controls. I now need to do an hierarchical clustering and a PCA plot of the normalized counts. I know I need to use rlog transform of the counts. My question is , do I need to take the dds from Step 1 (pointed in the code above) or from step 2 ? I am just not able to completely understand the difference between what happens when the counts are converted to a DESeqDataSet and once it goes through the pipeline ? My understanding is once it goes through teh DESeq pipeline there is a contrast between treatment and control which can affect the normalization. Whereas all I need is normalized matrix to perform hierarchical clustering and PCA for genes. and not just samples.

Following the bioconductor notes, I understand why I need to use rlog transformed counts, but if you can simplify and point me to the pipeline it will be great.

Thanks ,

RV

Thomas Schwarzl

unread,

Jul 26, 2016, 12:15:05 PM7/26/16

to opensequencing

Hi there,

DESeq2 is programmed so it conveniently hides a lot of things what it does. E.g. when you run DESeq(dds) it executes a couple of different functions storing more information in the dds object like size factors (for library size normalization) or dispersion estimates. The original data - the count table - is not modified.

As far as I know, the rlog transformation can make use of the estimated dispersion (which is good). Therefore, I would recommend to estimate the dispersions first, then run rlog transformation.

So your workflow right now is just fine.

Best,

Tom

PS: Have a look at this the brilliant teachings from our statistician, Bernd Klaus. You will definitely find some useful code examples there:

http://www-huber.embl.de/users/klaus/teaching.html#differential_expression_analysis_of_rna-seq_–_predoc_course_2014

--
You received this message because you are subscribed to the Google Groups "OpenSequencing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opensequencin...@googlegroups.com.
Visit this group at https://groups.google.com/group/opensequencing.
For more options, visit https://groups.google.com/d/optout.

RV

unread,

Jul 26, 2016, 1:15:12 PM7/26/16

to OpenSequencing

Hi Tom,

Thank you very much for validating and for the material.

Best,

Raghavee

Reply all

Reply to author

Forward