Batch effects in DE analysis

Matt Krosch

unread,

Oct 16, 2015, 2:16:05 AM10/16/15

to trinityrnaseq-users

Hi all,

I'm interested in knowing whether anyone has experience with batch effects associated with RNA extraction batches (rather than sequencing/library prep/platform, etc)? I'm dealing with some RNAseq data that has a hint of this sort of batch effect - or at least this batch effect may not be easy to differentiate from the true pattern because of somewhat poor sample design. In a nutshell, time-series samples were extracted sequentially, rather than randomised (because the study was over several months and it was considered better this way than risk RNA degradation). Extractions were all performed by the same person using the same methods, etc. It appears, however, that samples from one extraction batch in particular is quite different to the others, but this also corresponds to a time point that is expected to be different from the rest as well. There is no other obvious batch effect - ie, not all extraction batches are separate tight clusters, there is no correlation with sequencing batch, etc. I'm wondering whether the group might have some advice on whether there is a way to differentiate between a putative batch effect and real difference, or whether the data is essentially useless... Could exploring the sorts of genes that are differentially expressed (to see if they are associated with the different conditions encountered at the different time points) shed some light on whether this is 'real'?

Brian, is this a case that might benefit from some form of batch effect removal manipulation in edgeR as per your conversation last year with Gordon Smyth on Bioconductor Support (https://support.bioconductor.org/p/60581/)? Or is this in some way taken care of as best it can in the Trinity pipeline already??

Thanks,

Matt

Brian Haas

unread,

Oct 16, 2015, 7:28:45 AM10/16/15

to Matt Krosch, trinityrnaseq-users

Hi Matt

If batch effects correlate with the biological signal, then I'm not sure there's anything you can do.... It can be a huge problem, unfortunately.

However, it depends on how strong the batch effect is as compared to the biological signal. If you're looking for subtle differences in the biological variation, then proper design is going to be essential, so you can correct for it or avoid it to begin with. If you're doing a time course where there's a lot of bio variation, you'll still get meaningful results worth following up on for validation.

I'm absolutely not an authority on this subject, though, and others comments are welcomed here :)

-Brian

(by iPhone)

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Matt Krosch

unread,

Feb 4, 2016, 9:17:53 PM2/4/16

to trinityrnaseq-users, kros...@gmail.com

Hi all,

Just to follow up on this from last year, I've done some more tests with the data and I'm confident the signal is real and not a batch effect.

To clarify - this initial dataset was from a species of non-model aquatic insect that occurs in sympatry with two congeners. I was testing differences between natural populations of this species from different locations. I also sequenced a couple of it's congeners from each location with a view to looking into gene family expansion, etc for various reasons that I won't go into here. In any case when I put all three species expression data in a PCA, they cluster according to species (So, assembled all together, mapped all back with RSEM and then QC'd with PtR scripts). Moreover, these other species were extracted and sequenced as part of various batches of the original target species. This gives me heart that the putative batch effect I thought was clouding my biological pattern is not so.

The other thing I've done is to use several different filtering strategies to remove reads and reassemble with fewer data, to see if that has any impact on the biological pattern and to remove environmental contaminant reads. It seems that even the most stringent filter retains the biological pattern, and gene ontologies for differentially expressed transcripts all make sense based on expectations.

I'm not sure whether any of these approaches are going to hold up in court so to speak, but they give me some degree of confidence about my data where other options appear very limited.

Cheers

Matt

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

Reply all

Reply to author

Forward