Hi all,
Just to follow up on this from last year, I've done some more tests with the data and I'm confident the signal is real and not a batch effect.
To clarify - this initial dataset was from a species of non-model aquatic insect that occurs in sympatry with two congeners. I was testing differences between natural populations of this species from different locations. I also sequenced a couple of it's congeners from each location with a view to looking into gene family expansion, etc for various reasons that I won't go into here. In any case when I put all three species expression data in a PCA, they cluster according to species (So, assembled all together, mapped all back with RSEM and then QC'd with PtR scripts). Moreover, these other species were extracted and sequenced as part of various batches of the original target species. This gives me heart that the putative batch effect I thought was clouding my biological pattern is not so.
The other thing I've done is to use several different filtering strategies to remove reads and reassemble with fewer data, to see if that has any impact on the biological pattern and to remove environmental contaminant reads. It seems that even the most stringent filter retains the biological pattern, and gene ontologies for differentially expressed transcripts all make sense based on expectations.
I'm not sure whether any of these approaches are going to hold up in court so to speak, but they give me some degree of confidence about my data where other options appear very limited.
Cheers
Matt