Hi all,
I am running the denovo pipeline (parameters m3M4n5) in Stacks 2 beta2 on a paired-end ddRAD dataset. I was surprised to find in the output of gstacks that a large number (~20%) of the reverse reads overlapped the forward reads, as seen below.
Attempted to assemble and align paired-end reads for 167527 loci:
4 loci had no or almost no paired-end reads (0.0%);
1198 loci had paired-end reads that couldn't be assembled into a contig (0.7%);
For the remaining 166325 loci (99.3%), a paired-end contig was assembled;
Average contig size was 129.0 bp;
39303 paired-end contigs overlapped the forward region (23.6%;
mean overlap: 46.9bp; mean size of overlapped loci after merging: 208.1)
Out of 49930774 paired-end reads in these loci (mean 287.0 reads per locus),
47730026 were successfuly aligned (95.6%);
mean insert length was 203.6, stdev: 48.3 (based on aligned reads in overlapped loci).
I size-selected the fragments for a range between 346 and 406: with a single 70bp adaptor, I would expect the actual fragment length to be at least 276 base pairs. My read length was 125bp, so while there may have been some fragments shorter than 250bp, 20% seems quite high. I used an outside program (NGMerge) to test-merge raw forward and reverse reads for one of my individuals, and found only ~1.5% of reads had an overlap of >10bp, which seems more in line with my expectations. Does anyone know of an element of the Stacks pipeline that might account for this discrepancy? Obviously raw reads and loci are different things, but it seems surprising to go from 1.5% to 23.6% overlapping reads.
Two other anomalies that may be associated: the average contig size for the reverse reads is given as 129.0, but the reverse reads are only 125bp, and 120bp without the adaptor. And in my populations runs, all samples are listed as 0 for “valid samples at locus,” “confounded samples at locus”, and “absent samples at locus.”
In addition to the paired-end run, I separately ran the pipeline with only forward reads and with only reverse reads, and got virtually the same number of SNPs (selecting one per locus) and the same neighbor-joining tree results (which fit with what we know about the geographic structure of the species) for all 3 datasets, so clearly whatever the issue is it’s not substantially affecting the results. Still, I’d love to hear if anyone has had any similar results with the new gstacks or has any explanations for what could be happening.
Thank you!
Best,
Eamon Corbett
Hi Liz,
We have noticed that the difference between the library structure
intended at the bench and the actual structure seen by the
sequencer is sometimes significant. This may or not be the case
for your library (and there is no way to know for sure).
If you are willing to send us some of your data, we could look at
a few of these overlap manually, to check that they at least don't
look wrong. If you could just send a dozen matches.bam files (e.g.
through dropbox/drive), it would be ideal.
Best,
Nicolas
--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/681169cf-053c-45c9-b997-a25b313489a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.