I have been using DESeq2 to perform Differential gene expression analysis and I use a counts matrix obtained from HTseq or feature counts software. I use the following code :
countdata <- read.delim("counts_all.txt", header=TRUE, row.names=1)
#convert to matrix
countdata <- as.matrix(countdata)
head(countdata)
# Assign condition
condition <- factor(c(rep("Treated", 24), rep("Control", 3)), levels = c("Control","Treated"))
featureData <- colnames(countdata)
# Analysis with DESeq2 ----------------------------------------------------
library(DESeq2)
library(calibrate)
library(org.Mm.eg.db)
# Create a coldata frame and instantiate the DESeqDataSet. -----------> Step1
coldata <- data.frame(row.names=colnames(countdata), condition)
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
dds
# Run the DESeq pipeline. The adjusted p-value < 0.1 -------------> Step2
dds <- DESeq(dds)
# Get differential expression results
res <- results(dds)
# rlog transformation
# This uses the dds post DESeq
rld<- rlogTransformation(dds, blind=TRUE)
library("pheatmap")
select <- order(rowMeans(counts(dds,normalized=TRUE)),
decreasing=TRUE)[1:20]
pheatmap(assay(rld)[select,])
This is to do a pairwise comparison between the treatment and control. I have a time series experiment where I need I have 4 different time point, hence I have 24 treatments (replicates at different time points ) and the same 3 controls. I now need to do an hierarchical clustering and a PCA plot of the normalized counts. I know I need to use rlog transform of the counts. My question is , do I need to take the dds from Step 1 (pointed in the code above) or from step 2 ? I am just not able to completely understand the difference between what happens when the counts are converted to a DESeqDataSet and once it goes through the pipeline ? My understanding is once it goes through teh DESeq pipeline there is a contrast between treatment and control which can affect the normalization. Whereas all I need is normalized matrix to perform hierarchical clustering and PCA for genes. and not just samples.
Following the bioconductor notes, I understand why I need to use rlog transformed counts, but if you can simplify and point me to the pipeline it will be great.
Thanks ,
RV