Unexpected gstacks overlapping results

Eamon Corbett

unread,

Dec 1, 2017, 2:06:49 PM12/1/17

to Stacks

Hi all,

I am running the denovo pipeline (parameters m3M4n5) in Stacks 2 beta2 on a paired-end ddRAD dataset. I was surprised to find in the output of gstacks that a large number (~20%) of the reverse reads overlapped the forward reads, as seen below.

Attempted to assemble and align paired-end reads for 167527 loci:

4 loci had no or almost no paired-end reads (0.0%);

1198 loci had paired-end reads that couldn't be assembled into a contig (0.7%);

For the remaining 166325 loci (99.3%), a paired-end contig was assembled;

Average contig size was 129.0 bp;

39303 paired-end contigs overlapped the forward region (23.6%;

mean overlap: 46.9bp; mean size of overlapped loci after merging: 208.1)

Out of 49930774 paired-end reads in these loci (mean 287.0 reads per locus),

47730026 were successfuly aligned (95.6%);

mean insert length was 203.6, stdev: 48.3 (based on aligned reads in overlapped loci).

I size-selected the fragments for a range between 346 and 406: with a single 70bp adaptor, I would expect the actual fragment length to be at least 276 base pairs. My read length was 125bp, so while there may have been some fragments shorter than 250bp, 20% seems quite high. I used an outside program (NGMerge) to test-merge raw forward and reverse reads for one of my individuals, and found only ~1.5% of reads had an overlap of >10bp, which seems more in line with my expectations. Does anyone know of an element of the Stacks pipeline that might account for this discrepancy? Obviously raw reads and loci are different things, but it seems surprising to go from 1.5% to 23.6% overlapping reads.

Two other anomalies that may be associated: the average contig size for the reverse reads is given as 129.0, but the reverse reads are only 125bp, and 120bp without the adaptor. And in my populations runs, all samples are listed as 0 for “valid samples at locus,” “confounded samples at locus”, and “absent samples at locus.”

In addition to the paired-end run, I separately ran the pipeline with only forward reads and with only reverse reads, and got virtually the same number of SNPs (selecting one per locus) and the same neighbor-joining tree results (which fit with what we know about the geographic structure of the species) for all 3 datasets, so clearly whatever the issue is it’s not substantially affecting the results. Still, I’d love to hear if anyone has had any similar results with the new gstacks or has any explanations for what could be happening.

Thank you!

Best,

Eamon Corbett

Julian Catchen

unread,

Dec 1, 2017, 3:35:45 PM12/1/17

to stacks...@googlegroups.com, eamonc...@gmail.com

Hi Eamon, you will want to upgrade to Beta5 as we have done a lot of
work on the overlapping algorithm. -julian

Elizabeth Milano

unread,

Oct 2, 2018, 3:16:44 PM10/2/18

to Stacks

I am also experiencing a higher than expected number of overlapping paired-end contigs. In my case, I size-selected ddRAD fragments between 300 and 400 bp where the final library size, after adapters, is ~440bp (measured on a bioanalyzer). I have 100bp paired-end reads so I don't expect any overlap, but am seeing 17.5% overlap after gstacks. Is there an option to drop these pairs or a way to make the matching more strict? I am running stacks 2.2, gstacks.log below.

Thanks,

Liz

gstacks v2.2, executed 2018-09-27 11:40:32

/.../2.2/bin/gstacks -P ./m3n2stacks -t 40 -M info/popmap

Locus/sample distributions will be written to './m3n2stacks/gstacks.log.distribs'.

Configuration for this run:

Input mode: denovo

Population map: 'info/popmap'

Input files: 8, e.g. './m3n2stacks/PIAL_FS_D4B2_01a.matches.bam'

Output to: './m3n2stacks/'

Model: marukilow (var_alpha: 0.05, gt_alpha: 0.05)

Reading BAM headers...

Processing all loci...

1%...

2%...

5%...

10%...

20%...

50%...

100%

Attempted to assemble and align paired-end reads for 111430 loci:

11 loci had no or almost no paired-end reads (0.0%);

59 loci had paired-end reads that couldn't be assembled into a contig (0.1%);

For the remaining 111360 loci (99.9%), a paired-end contig was assembled;

Average contig size was 101.0 bp;

19484 paired-end contigs overlapped the forward region (17.5%)

Mean overlap: 25.8bp; mean size of overlapped loci after merging: 171.3;

Out of 2639641 paired-end reads in these loci (mean 17.5 reads per locus),

1953309 were successfuly aligned (74.0%);

Mean insert length was 179.3, stdev: 16.2 (based on aligned reads in overlapped loci).

Genotyped 111430 loci:

effective per-sample coverage: mean=49.9x, stdev=15.0x, min=19.3x, max=73.9x

mean number of sites per locus: 165.2

a consistent phasing was found for 2510 of out 2813 (89.2%) diploid loci needing phasing

gstacks is done.

Nicolas Rochette

unread,

Oct 2, 2018, 3:35:00 PM10/2/18

to Stacks

Hi Liz,

We have noticed that the difference between the library structure intended at the bench and the actual structure seen by the sequencer is sometimes significant. This may or not be the case for your library (and there is no way to know for sure).

If you are willing to send us some of your data, we could look at a few of these overlap manually, to check that they at least don't look wrong. If you could just send a dozen matches.bam files (e.g. through dropbox/drive), it would be ideal.

Best,

Nicolas

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/681169cf-053c-45c9-b997-a25b313489a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Elizabeth Milano

unread,

Oct 2, 2018, 8:12:06 PM10/2/18

to Stacks

Thanks, I emailed you a drive link.

Reply all

Reply to author

Forward