How to do best batch correction?

27 views

Skip to first unread message

BigOmics Analytics Team

unread,

Mar 1, 2023, 11:18:42 AM3/1/23

to Omics Playground

[from Marian M. email 1.3.2023 ]

Basically, we performed mRNA sequencing of primary mouse mast cells, WT and with gene-specific knockout. All samples are in quadruplicates and we would like to do differential gene expression analysis for both setups. The samples are labeled 1-4, according to the experiment batch. And from the initial we got PCA, we noticed that our samples somehow cluster based on the experiment ‘batch’. So we would like to correct for this and see if this can improve the data.

I tried to do it in omicsplayground (by doing a supervised correction based on the batch, plus unsupervised correction with PCA) and it seems that this improved our data a lot. Please see attached example. Since we are not sure exactly how the batch correction works, and I am not comfortable with my knowledge on this, we would like to ask how it exactly works and which parameters are best to use without creating too much bias in our data. In addition to the supervised correction, should I add an unsupervised method such as PCA? And what is the difference among the different options on this (PCA, SVA, NNM)? I noticed that if I perform different unsupervised correction, the clustering changes in some ways. I hope you can help me with understanding this function and in making the most out of our analysis. Thank you!

Best regards,

Marian

BigOmics Analytics Team

unread,

Mar 1, 2023, 11:26:02 AM3/1/23

to Omics Playground

Hi Marian,

You're hitting the right questions and answers. It seems your experiments need batch correction. Generally, be very careful with batch correction because it will alter your data, but in many practical cases it is necessary. Be aware, with batch correction you are not creating bias but actually try to remove bias. In your case your data is biased by design because of the batch effects already. In general you need to correct as minimal as possible but enough to save your data.

1. Start with supervised with 1 batch variable and 1 model parameter.

2. Include more batch variables if needed. Try cell cycle, total counts, etc.

3. As last resort add an (semi) unsupervised method. These are very strong methods and should be really used with care. PCA correction subtracts higher order PC components that are correlated with the batch vector. SVA (surragate variable analysis, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307112/) is somehow magic but still an accepted method. NNM (nearest neighbour matching) is yet unpublished one of our own methods, that tries to match control and reference in an unsupervised way.

Our supervised method is based on linear regression using removeBatchEffect (limma R package). Another often used method is ComBat but this method cannot remove continuous covariates so we prefer limma. In practice they perform very similar.

Best

Ivo

BigOmics Team

Reply all

Reply to author

Forward

0 new messages