Pearson Correlation- Differential Expression Analysis

Neeraja Balasubrahmaniam

unread,

Nov 20, 2023, 9:52:23 AM11/20/23

to trinityrnaseq-users

Hi Brian,

I had 2 questions about the Pearson correlation (based on the correlation plot and the .corr data) that is done for DE genes. I am assuming it uses the TMM matrix that has a log2 transformation as well as a row centering done.

1) Is there a reason why we do a log2 transformation on the already normalized TMM matrix?

2) I know row centering is done for PCA but how does doing this benefit Pearson correlations here?

Any insights are appreciated as always!

Best,

Neeraja

Brian Haas

unread,

Nov 20, 2023, 10:08:04 AM11/20/23

to Neeraja Balasubrahmaniam, trinityrnaseq-users

Hi

Responses below

On Mon, Nov 20, 2023 at 9:52 AM Neeraja Balasubrahmaniam <neeraj...@gmail.com> wrote:

Hi Brian,

I had 2 questions about the Pearson correlation (based on the correlation plot and the .corr data) that is done for DE genes. I am assuming it uses the TMM matrix that has a log2 transformation as well as a row centering done.

1) Is there a reason why we do a log2 transformation on the already normalized TMM matrix?

Expression values for transcripts are spread out from low to high across may orders of magnitude. The log2 transformation makes it so that the larger values don't overwhelm the calculations so much - a mathematician would have a better explanation for this.

2) I know row centering is done for PCA but how does doing this benefit Pearson correlations here?

We don't center the data for the Pearson correlation, only the PCA.

hope this helps,

B

Any insights are appreciated as always!

Best,
Neeraja

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/trinityrnaseq-users/9ba51d45-659a-404a-a02e-dc7555178edcn%40googlegroups.com.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Neeraja Balasubrahmaniam

unread,

Nov 21, 2023, 7:40:47 AM11/21/23

to Brian Haas, trinityrnaseq-users

Thanks so much Brian!

Neeraja Balasubrahmaniam

unread,

Nov 21, 2023, 7:40:52 AM11/21/23

to Brian Haas, trinityrnaseq-users

Hi Brian,

Sorry for the follow-up question about this but, just looked at the diffExpr.P_C.matrix.R code and it looks like it did use the centered data for the Pearson correlations too? Or am I misreading something? See below for snippet and attached for the code:

***Start of diffExpr.P0.001_C2.matrix.R code part

initial_matrix = data # store before doing various data transformations
data = log2(data+1)
sample_factoring = colnames(data)
for (i in 1:nsamples) {
sample_type = sample_types[i]
replicates_want = sample_type_list[[sample_type]]
sample_factoring[ colnames(data) %in% replicates_want ] = sample_type
}
sampleAnnotations = matrix(ncol=ncol(data),nrow=nsamples)
for (i in 1:nsamples) {
sampleAnnotations[i,] = colnames(data) %in% sample_type_list[[sample_types[i]]]
}
sampleAnnotations = apply(sampleAnnotations, 1:2, function(x) as.logical(x))
sampleAnnotations = sample_matrix_to_color_assignments(sampleAnnotations, col=sample_colors)
rownames(sampleAnnotations) = as.vector(sample_types)
colnames(sampleAnnotations) = colnames(data)
data = as.matrix(data) # convert to matrix

# Centering rows
data = t(scale(t(data), scale=F))

write.table(data, file="diffExpr.P0.001_C2.matrix.log2.centered.dat", quote=F, sep=' ');
if (nrow(data) < 2) { stop("

**** Sorry, at least two rows are required for this matrix.

");}
if (ncol(data) < 2) { stop("

**** Sorry, at least two columns are required for this matrix.

");}
sample_cor = cor(data, method='pearson', use='pairwise.complete.obs')
write.table(sample_cor, file="diffExpr.P0.001_C2.matrix.log2.centered.sample_cor.dat", quote=F, sep=' ')
sample_dist = dist(t(data), method='euclidean')
hc_samples = hclust(sample_dist, method='complete')
pdf("diffExpr.P0.001_C2.matrix.log2.centered.sample_cor_matrix.pdf")
sample_cor_for_plot = sample_cor
if (is.null(hc_samples)) { RowV=NULL; ColV=NULL} else { RowV=as.dendrogram(hc_samples); ColV=RowV }
heatmap.3(sample_cor_for_plot, dendrogram='both', Rowv=RowV, Colv=ColV, col = myheatcol, scale='none', symm=TRUE, key=TRUE,density.info='none', trace='none', symkey=FALSE, symbreaks=F, margins=c(10,10), cexCol=1, cexRow=1, cex.main=0.75, main=paste("sample correlation matrix
", "diffExpr.P0.001_C2.matrix.log2.centered") )
dev.off()
gene_cor = NULL

***End of diffExpr.P0.001_C2.matrix.R code part

Thanks!

Neeraja

diffExpr.P0.001_C2.matrix.R

Brian Haas

unread,

Nov 21, 2023, 9:42:30 AM11/21/23

to Neeraja Balasubrahmaniam, trinityrnaseq-users

Oh, I see.

I thought you were running this:
https://github.com/trinityrnaseq/trinityrnaseq/wiki/QC-Samples-and-Biological-Replicates#compare-replicates-for-each-of-your-samples

Instead, when the analyse_diff_results script is run, it auto generates a heatmap and a correlation plot based on just the DE features and does it simultaneously using the log2 and centered expression results.

That's fine - for the correlation there, we're just interested in looking at the correlation of expression profiles. Since this part is limited only to those features that are detected as DE, it's just confirmational.