DESeq2 normalized counts for HIerarchical clustering and PCA

298 views
Skip to first unread message

RV

unread,
Jul 26, 2016, 11:34:17 AM7/26/16
to OpenSequencing
Hi, 

I have been using DESeq2 to perform Differential gene expression analysis and I use a counts matrix obtained from HTseq or feature counts software. I use the following code : 

countdata <- read.delim("counts_all.txt", header=TRUE, row.names=1)
#convert to matrix
countdata
<- as.matrix(countdata)
head
(countdata)
# Assign condition
condition
<- factor(c(rep("Treated", 24), rep("Control", 3)), levels = c("Control","Treated"))
featureData
<- colnames(countdata)


# Analysis with DESeq2 ----------------------------------------------------


library
(DESeq2)
library
(calibrate)
library
(org.Mm.eg.db)
# Create a coldata frame and instantiate the DESeqDataSet. -----------> Step1
coldata
<- data.frame(row.names=colnames(countdata), condition)
dds
<- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
dds
# Run the DESeq pipeline. The adjusted p-value < 0.1 -------------> Step2
dds
<- DESeq(dds)
# Get differential expression results
res
<- results(dds)
# rlog transformation
# This uses the dds post DESeq
rld<- rlogTransformation(dds, blind=TRUE)
library("pheatmap")
select <- order(rowMeans(counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]

pheatmap(assay(rld)[select,])


This is to do a pairwise comparison between the treatment and control. I have a time series experiment where I need I have 4 different time point, hence I have 24 treatments (replicates at different time points ) and the same 3 controls. I now need to do an hierarchical clustering and a PCA plot of the normalized counts. I know I need to use rlog transform of the counts. My question is , do I need to take the dds from Step 1 (pointed in the code above) or from step 2 ? I am just not able to completely understand the difference between what happens when the counts are converted to a DESeqDataSet and once it goes through the pipeline ? My understanding is once it goes through teh DESeq pipeline there is a contrast between treatment and control which can affect the normalization. Whereas all I need is normalized matrix to perform hierarchical clustering and PCA for genes. and not just samples. 

Following the bioconductor notes, I understand why I need to use rlog transformed counts, but if you can simplify and point me to the pipeline it will be great.

Thanks , 
RV


Thomas Schwarzl

unread,
Jul 26, 2016, 12:15:05 PM7/26/16
to opensequencing
Hi there,

DESeq2 is programmed so it conveniently hides a lot of things what it does. E.g. when you run DESeq(dds) it executes a couple of different functions storing more information in the dds object like size factors (for library size normalization) or dispersion estimates. The original data - the count table - is not modified.

As far as I know, the rlog transformation can make use of the estimated dispersion (which is good). Therefore,  I would recommend to estimate the dispersions first, then run rlog transformation.

So your workflow right now is just fine.

Best,
Tom

PS: Have a look at this the brilliant teachings from our statistician, Bernd Klaus. You will definitely find some useful code examples there:





--
You received this message because you are subscribed to the Google Groups "OpenSequencing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opensequencin...@googlegroups.com.
Visit this group at https://groups.google.com/group/opensequencing.
For more options, visit https://groups.google.com/d/optout.

RV

unread,
Jul 26, 2016, 1:15:12 PM7/26/16
to OpenSequencing
Hi Tom, 
Thank you very much for validating and for the material. 
Best, 
Raghavee
Reply all
Reply to author
Forward
0 new messages