Using --disable-gapped improved de novo assembly

Geoffrey Walker

unread,

Oct 6, 2025, 1:44:29 PMOct 6

to Stacks

Hello,

I am currently assembling stacks de novo using 3RAD sequence data (150bp paired-end) that was generated across multiple sequencing runs. The resulting data had high missingness, which I originally attributed to a high proportion of removed PCR clones and chance differences in loci between sequencing runs. Some runs had higher missingness than others.

While troubleshooting an unrelated issue, I attempted rerunning the stacks denovo pipeline with the --disable-gapped feature for ustacks, cstacks, and sstacks. I was surprised to see the resulting dataset from stacks populations retained more SNPs, had dramatically lower missingness, and the uneven missingness across libraries disappeared.

Comparing sstacks logs for the same sample with gapped alignments enabled:

---

36984 sample loci compared against the catalog containing 240623 loci.

4708 matching loci, 175 contained no verified haplotypes.

7 loci matched more than one catalog locus and were excluded.

168 loci contained SNPs unaccounted for in the catalog and were excluded.

5077 total haplotypes examined from matching loci, 4887 verified.

Searching for gapped alignments...

Out of 36984 query loci, 32451 gapped alignments attempted.

29373 loci matched one catalog locus; 36906 total haplotypes examined, 32347 verified.

3078 loci matched no catalog locus;

1103 loci matched more than one catalog locus and were excluded.

576 loci contained SNPs unaccounted for in the catalog and were excluded.

0 loci had no verified haplotypes.

0 loci had inconsistent alignments to a catalog locus and were excluded.

---

Versus gapped alignments disabled:

---

37463 sample loci compared against the catalog containing 198918 loci.

37297 matching loci, 245 contained no verified haplotypes.

15 loci matched more than one catalog locus and were excluded.

230 loci contained SNPs unaccounted for in the catalog and were excluded.

41666 total haplotypes examined from matching loci, 41385 verified.

---

Shows more sample loci matching to a smaller catalog when gapped alignments are disabled.

I also attached the individual missingness profiles resulting from stacks populations (r=0.5, p= all pops) with gapped alignments enabled (7,489 SNPs, 21.97% dataset missingness) versus disabled (10,718 SNPs, 12.53% dataset missingness).

I’m wondering if there could have been indel sequencing errors causing messy gapped alignments and poor stacks assembly. I’m obviously very excited to see the much-improved data, but I want to make sure the improvement isn’t due to some problematic artificial consequence of disabling gapped alignments. I would think that disabling gapped alignments would be considered a stricter, more conservative approach to stack formation and should not be an issue. Is that accurate?

Thank you,
Geoffrey

gaps-disabled missingness.png

default missingness.png

Catchen, Julian

unread,

Oct 9, 2025, 8:08:20 PMOct 9

to stacks...@googlegroups.com

Hi Geoffrey,

If your data have a lot of segregating indels and you disable gapped alignments, the main effect will be those alleles with gaps between them will not be merged together, so they will be maintained as two, separate homozygous loci, instead of one heterozygous locus. So, it would result in a depressed rate of heterozygosity.

But it is strange that you have fewer loci in your catalog when you disable gaps then when you allow them. What version of Stacks did you use? It is hard to know what might be going on without more information. It might help to look at the ustacks output for one or two samples with and without gapped alignments.

One other consideration is that this set of loci that is variable between your gaps/no gaps runs might not be particularly useful, they may just be repetitive regions of the genome merging together.

Best,

Juilan

Geoffrey Walker

unread,

Oct 10, 2025, 11:27:35 AMOct 10

to Stacks

Hi Julian,

Thank you for your quick response. I certainly want to avoid downwardly biasing heterozygosity, so I will work to figure out how significant the impact is.

Following ustacks, there are more loci assembled in each individual with --disable-gapped, as expected (highlighted sections differ between methods):

With gapped alignment (default):

Sample 1

Loaded 946855 reads; formed:

49040 stacks representing 497773 primary reads (52.6%)

419890 secondary stacks representing 449082 secondary reads (47.4%)

Stack coverage: mean=10.15; stdev=11.65; max=486; n_reads=497773(52.6%)

Removing repetitive stacks: cov > 46 (mean+3*stdev)...

Blacklisted 954 stacks.

Coverage after repeat removal: mean=9.57; stdev=8.01; max=46; n_reads=460030(48.6%)

Assembling stacks (max. dist. M=1)...

Assembled 48086 stacks into 43322; blacklisted 61 stacks.

Coverage after assembling stacks: mean=10.63; stdev=9.09; max=89; n_reads=456094(48.2%)

Merging secondary stacks (max. dist. N=3 from consensus)...

Merged 258883 out of 449082 secondary reads (57.6%), 7120 merged with gapped alignments.

Coverage after merging secondary stacks: mean=16.67; stdev=14.28; max=147; n_reads=714977(75.5%)

Assembling stacks, allowing for gaps (min. match length 80.0%)...

Assembled 43322 stacks into 42688 stacks.

Coverage after gapped assembly: mean=16.92; stdev=14.48; max=147; n_reads=714977(75.5%)

Merging secondary stacks, allowing for gaps (min. match length 80.0%)...

Merged 5851 out of 190199 secondary reads (3.1%).

Coverage after merging gapped secondary stacks: mean=17.06; stdev=14.56; max=147; n_reads=720828(76.1%)

Final number of stacks built: 42688

Final coverage: mean=17.06; stdev=14.56; max=147; n_reads=720828(76.1%)

With --disable-gapped:

Sample 1

Loaded 946855 reads; formed:

49040 stacks representing 497773 primary reads (52.6%)

419890 secondary stacks representing 449082 secondary reads (47.4%)

Stack coverage: mean=10.15; stdev=11.65; max=486; n_reads=497773(52.6%)

Removing repetitive stacks: cov > 46 (mean+3*stdev)...

Blacklisted 954 stacks.

Coverage after repeat removal: mean=9.57; stdev=8.01; max=46; n_reads=460030(48.6%)

Assembling stacks (max. dist. M=1)...

Assembled 48086 stacks into 43322; blacklisted 61 stacks.

Coverage after assembling stacks: mean=10.63; stdev=9.09; max=89; n_reads=456094(48.2%)

Merging secondary stacks (max. dist. N=3 from consensus)...

Merged 258883 out of 449082 secondary reads (57.6%), 7120 merged with gapped alignments.

Coverage after merging secondary stacks: mean=16.67; stdev=14.28; max=147; n_reads=714977(75.5%)

Final number of stacks built: 43322

Final coverage: mean=16.67; stdev=14.28; max=147; n_reads=714977(75.5%)

However, when these samples are used to build the cstacks catalog, the --disabled-gapped loci have a much easier time matching with something previously added to the catalog than the loci with gaps. For example, here is the cstacks output for when the 150^th sample is added to the catalog:

With gapped alignment (default):

Searching for sequence matches...

233378 loci in the catalog, 30549227 kmers in the catalog hash.

Searching for gapped alignments...

Merging matches into catalog...

5397 loci were matched to a catalog locus.

26322 loci were matched to a catalog locus using gapped alignments.

973 loci were newly added to the catalog.

286 loci matched more than one catalog locus, linking them.

579 linked catalog loci were merged into 285 loci.

With --disable-gapped:

Searching for sequence matches...

196360 loci in the catalog, 19339485 kmers in the catalog hash.

Merging matches into catalog...

32786 loci were matched to a catalog locus.

0 loci were matched to a catalog locus using gapped alignments.

135 loci were newly added to the catalog.

17 loci matched more than one catalog locus, linking them.

34 linked catalog loci were merged into 17 loci.

The cumulative effect of this is many more loci being newly added to the catalog with the default gapped alignments enabled for ustacks and cstacks, followed by much more successful matching to the catalog by sstacks, as shown in my first post.

When analyzing the data downstream, I have seen evidence of likely technical sequencing errors which get cleared up during quality filtering, but may be interfering with stacks assembly and may account for the high proportion of secondary reads. The genome of this species is likely very large and repetitive as well.

Thanks,
Geoffrey

Reply all

Reply to author

Forward