Hi Eloise,
You have to be specifc about what you mean by “batch effect.” What specific problem are you trying to correct and how can you correct it?
For example, if your samples from plate 1 received half the coverage on the Nextseq versus the samples on the Novaseq, then that is a problem but what does it mean to correct it? If your coverage is much lower on plate 1, you will get fewer SNP calls and probably a slight bias towards calling more homozygotes to heterozygotes, but how do you want to “fix” that informatically? Certainly you could subsample your Novaseq results to have less coverage so it matches the nextseq data, but that is not going to help your analysis since now you have three bad libraries instead of one. If the libraries were prepared separately, then the DNA quality may have been significantly different between the different plates (maybe one library is from museum samples and the other libraries are from fresh DNA), and again, you can’t really correct this after the fact. You can make new libraries and sequence plate 1 again, or just sequence the existing library again to increase coverage. Importantly, in both of these cases changing software cannot fix these types of issues, however, without more details I don’t know if this actually fits your situation.
With respect to PCR duplicates (these will again be higher if the DNA quality of a particular library is lower), they cannot be detected in ddRAD data unless your adaptors have a degenerate oligo sequence embedded in them to label unique PCR products (e.g. 3RAD does this and then you can use clone_filter to remove them). This is because PCR duplicates are detected by the identical alignment positions of a pair of reads – since ddRAD data all have identical alignment positions for pairs of reads at a single locus (since both reads are anchored by a restriction cutsite) they are indistinguishable. If you apply this type of filter you will generally remove most of your data indiscriminately. (For regular paired-end RAD data, gstacks can remove PCR duplicates with the flag you mention.)
Best,
julian
Hi Eloise,
Sorry for the delay, I lost track of your email. It is good that you took the time to distribute the samples randomly across the plates. The first thing I would do (assuming you are still working on this particular problem) would be to quantify the sequencing depth of the samples on plate 1 vs. plates 2 and 3 (denovo_map.pl will give you a table of depths). I would assume you will see large differences in the depth of coverage. The most likely reason PCA/STRUCTURE are grouping plate 1 is because there is a large number of SNPs that appear on plates 2 and 3, but are missing on plate 1. One strategy would be to use the filtering in populations to reduce the whole analysis to SNPs that are available on all 3 plates (-r and -p filters in populations). This may or may not work, depending on what the coverage levels are for plate 1 and therefore how much room you have to filter.
If you can sequence again, it is easy to add the additional sequence – you just run process_radtags on the new data, and then concatenate the sample files from the old and new sequencing runs together.
Removing individual sites is hard to do, but in reality is very similar to filtering SNPs – the same caveat applies, depending on the depth for the first plate, you may or may not succeed in filtering.
Lastly, what happens if you look at the second or third dimensions of the PCA plot instead of the first? Do you see samples group biologically instead of by plate? (Similarly in STRUCTURE, if you increase K, do you see a different grouping?)