batch effect and clone_filter

144 views
Skip to first unread message

Eloise Cave

unread,
Aug 1, 2023, 5:29:39 PM8/1/23
to Stacks
Hello, 
I've be combing through this group to learn about possible batch effect solutions. However, I haven't really found a good solution other than using a different software. 
I recently sequenced 3 plate using ddRAD-seq. Plate 1 was sequenced on a NextSeq and plates 2 and 3 were sequenced on a Novaseq. After going through the pipeline and filtering the SNPs I found myself seeing a batch effect between plate 1 (from Nextseq) and plates 2/3 (Novaseq). Is there something or some parameter I should be focusing on? I am very new to this so forgive me if I am too vague. 

Also, I've been trying to understand the proper way to remove PCR duplicates. I initially was trying clone_filter but came across a couple of threads mentioning that it is not appropriate for dd-RAD data. Is this correct? If so, is the -rm-pcr-duplicates in gstacks sufficient? 

Thanks!
Eloise 

Catchen, Julian

unread,
Aug 2, 2023, 4:48:44 PM8/2/23
to stacks...@googlegroups.com

Hi Eloise,

 

You have to be specifc about what you mean by “batch effect.” What specific problem are you trying to correct and how can you correct it?

 

For example, if your samples from plate 1 received half the coverage on the Nextseq versus the samples on the Novaseq, then that is a problem but what does it mean to correct it? If your coverage is much lower on plate 1, you will get fewer SNP calls and probably a slight bias towards calling more homozygotes to heterozygotes, but how do you want to “fix” that informatically? Certainly you could subsample your Novaseq results to have less coverage so it matches the nextseq data, but that is not going to help your analysis since now you have three bad libraries instead of one. If the libraries were prepared separately, then the DNA quality may have been significantly different between the different plates (maybe one library is from museum samples and the other libraries are from fresh DNA), and again, you can’t really correct this after the fact. You can make new libraries and sequence plate 1 again, or just sequence the existing library again to increase coverage. Importantly, in both of these cases changing software cannot fix these types of issues, however, without more details I don’t know if this actually fits your situation.

 

With respect to PCR duplicates (these will again be higher if the DNA quality of a particular library is lower), they cannot be detected in ddRAD data unless your adaptors have a degenerate oligo sequence embedded in them to label unique PCR products (e.g. 3RAD does this and then you can use clone_filter to remove them). This is because PCR duplicates are detected by the identical alignment positions of a pair of reads – since ddRAD data all have identical alignment positions for pairs of reads at a single locus (since both reads are anchored by a restriction cutsite) they are indistinguishable. If you apply this type of filter you will generally remove most of your data indiscriminately. (For regular paired-end RAD data, gstacks can remove PCR duplicates with the flag you mention.)

 

Best,

 

julian

Eloise Cave

unread,
Aug 2, 2023, 5:55:14 PM8/2/23
to Stacks
HI Julian, 
Thank you for your response. It has clarified a lot for me. I wanted to specify a little bit more on the batch effect. As mentioned before I have 3 plates, 1 sequenced on a Nextseq and the other 2 on a Novaseq. I see a batch effect when I run a PCA on filtered SNP, specifically the samples in plate 1 (nextseq) cluster seperately from plates 2 and 3 (both novaseq). I also see this clustering when running the filtered data through STRUCTURE. All of plate 1 samples are assigned to one cluster and samples in 2-3 to another. My samples are tiger shark tissue samples collected from the Gulf of Mexico, Northern Caribbean, and western Atlantic. They are not arranged specifically on the plates for it to be some kind of sample arrangement coincidence, they are randomly arranged individuals across all my sample locations somewhat evenly distributed. Biologically, samples across these large geographical ranges should not be similar for them to be assigned to the same population/cluster (we have some evidence of this with microsatellite data). The samples were carefully selected to be high quality and high concentration with potentially some lower quality and lower concentration ones sprinkled among all 3 plates randomly. 

The differences in sequencing depth definitely makes sense and it was something I had suspected all along though I wasn't sure if there was a way around it bioinfomatically. We may be able to sequence them again but it depends on the cost. If I can sequence them again, how can I ADD the new sequences to the existing sequences? 
Additionally, one solution that was brought to my attentions was to remove the loci or sites that might be causing the batch effect/differences between the two sequencing platforms. Is this a viable option? 
Also, If I am not able to re-sequence the individuals in plate 1 (Nextseq) would you suggest just not using the samples from plate 1 (Nextseq) given potential differences is sequencing depth? This was my last resort option. 

Really appreciate all the help. Curious to hear your suggestions.
Eloise 

Catchen, Julian

unread,
Aug 30, 2023, 5:23:17 PM8/30/23
to stacks...@googlegroups.com

Hi Eloise,

 

Sorry for the delay, I lost track of your email. It is good that you took the time to distribute the samples randomly across the plates. The first thing I would do (assuming you are still working on this particular problem) would be to quantify the sequencing depth of the samples on plate 1 vs. plates 2 and 3 (denovo_map.pl will give you a table of depths). I would assume you will see large differences in the depth of coverage. The most likely reason PCA/STRUCTURE are grouping plate 1 is because there is a large number of SNPs that appear on plates 2 and 3, but are missing on plate 1. One strategy would be to use the filtering in populations to reduce the whole analysis to SNPs that are available on all 3 plates (-r and -p filters in populations). This may or may not work, depending on what the coverage levels are for plate 1 and therefore how much room you have to filter.

 

If you can sequence again, it is easy to add the additional sequence – you just run process_radtags on the new data, and then concatenate the sample files from the old and new sequencing runs together.

 

Removing individual sites is hard to do, but in reality is very similar to filtering SNPs – the same caveat applies, depending on the depth for the first plate, you may or may not succeed in filtering.

 

Lastly, what happens if you look at the second or third dimensions of the PCA plot instead of the first? Do you see samples group biologically instead of by plate? (Similarly in STRUCTURE, if you increase K, do you see a different grouping?)

Reply all
Reply to author
Forward
0 new messages