Hi Everyone,
I have used Salmon to quantify RNA-seq from two batches of ~100 samples each collected a couple of years apart.
I wish to batch-correct these and use the two datasets in conjunction to call QTLs. What is the best way to do this?
From other posts on this forum (e.g.
https://groups.google.com/forum/#!searchin/sailfish-users/normalization%7Csort:date/sailfish-users/jBf9SGiH1AM/nR2ekF5ECwAJ) I see people have suggested batch-correction via the TMM methods incorporated in tools such as DEseq, or to use Sleuth to batch correct.
However my issue is that as far as I can see these tools just build models that allow you to account for batch-effects when calculating something like Differential Expression. What I actually want is to output a new batch-corrected expression matrix that I can then input into other tools (i.e. QTL detection programmes).
I was thinking of using a tool like Combat (
https://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf), where you input an expression matrix, provide details of what samples are in what batch, and get out a corrected matrix at the end. However as I understand it Combat was built to carry out Gene-level batch correction, using metrics such as R/FPKM. I wonder whether it would be applicable to Transcript-level data, which would be more sparse (i.e. the expression would be distributed over a greater number of features, and a larger proportion of features would have low-to-zero expression in the majority of samples).
(One option to reduce the sparsity would be to filter out Transcripts of low expression (e.g. not expressed in >90% of samples). Filtering such as this is also discussed in this post (
https://groups.google.com/forum/#!searchin/sailfish-users/normalization%7Csort:date/sailfish-users/jBf9SGiH1AM/nR2ekF5ECwAJ) and the suggestion seems to be that one could pre-filter counts before batch-correction, but it is not appropriate to pre-filter TPM values because they are scaled to the total number of reads and if you start removing values then the scaling will be affected. The issue here is that I think Combat was designed to work with R/FPKM rather than counts (and so I assume it would be more comfortable with a metric such as TPM). I ideally want my output to be in the form of TPM because I need expression to be scaled by Transcript-length when calling QTLs.)
In summary: I would like to find a tool that takes Transcript-level expression and outputs batch-corrected TPM for input into QTL packages.
Many thanks for your assistance, and I am happy to add any extra detail my question may require.
All the best,
Toby