gstacks removes almost all reads (BWA issue with mapping??)

Andrew Veale

unread,

Mar 19, 2020, 1:01:42 PM3/19/20

to Stacks

Hi All,

I have some GBS reads from DartSeq that I wanted to re-assemble now that I have a high quality genome. The genome is chromosome level quality so it shouldn't have any issues. So I took the de-multiplexed reads they gave me (FASTQ interleaved) and mapped them to the genome with BWA producing SAM files, then transformed these into sorted BAM files, then ran ref_map.pl with default settings.

It looks like basically all GBS reads were either unmapped, mapped with insufficient quality, or excessively soft clipped. I'm guessing this is a BWA issue? Should I use Bowtie2 instead without soft clipping?

Given that almost all loci appear in only 1 - 2 individuals the output is obviously unusable.

Any help would be appreciated!

log file results below:

/usr/local/bin/ref_map.pl -T 15 -o ./stacks --popmap Stoat_population_map1.txt --samples ./Alignments

gstacks

==========

/usr/local/bin/gstacks -I ./Alignments/ -M Stoat_population_map1.txt -O ./stacks -t 15

Logging to './stacks/gstacks.log'.

Locus/sample distributions will be written to './stacks/gstacks.log.distribs'.

Configuration for this run:

Input mode: reference-based

Population map: 'Stoat_population_map1.txt'

Input files: 86, e.g. './Alignments/10S.bam'

Output to: './stacks/'

Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)

Reading BAM headers...

Processing all loci...

1K...

2K...

5K...

10K...

20K...

50K...

100K...

done.

Read 203688022 BAM records:

kept 421469 primary alignments (0.2%), of which 0 reverse reads

skipped 30398146 primary alignments with insufficient mapping qualities (14.9%)

skipped 118010299 excessively soft-clipped primary alignments (57.9%)

skipped 54819320 unmapped reads (26.9%)

skipped some suboptimal (secondary/supplementary) alignment records

Per-sample stats (details in 'gstacks.log.distribs'):

read 2368465.4 records/sample (1892566-2624233)

kept 0.0%-1.2% of these

Built 108039 loci comprising 421469 forward reads and 0 matching paired-end reads; mean insert length was 0.0 (sd: nan).

Genotyped 108039 loci:

effective per-sample coverage: mean=3.7x, stdev=1.1x, min=1.0x, max=11.4x

mean number of sites per locus: 74.9

a consistent phasing was found for 94 of out 94 (100.0%) diploid loci needing phasing

gstacks is done.

populations

==========

/usr/local/bin/populations -P ./stacks -t 15 -M Stoat_population_map1.txt

Logging to './stacks/populations.log'.

Locus/sample distributions will be written to './stacks/populations.log.distribs'.

populations parameters selected:

Percent samples limit per population: 0

Locus Population limit: 1

Percent samples overall: 0

Minor allele frequency cutoff: 0

Maximum observed heterozygosity cutoff: 1

Applying Fst correction: none.

Pi/Fis kernel smoothing: off

Fstats kernel smoothing: off

Bootstrap resampling: off

Parsing population map...

The population map contained 86 samples, 1 population(s), 1 group(s).

Working on 86 samples.

Working on 1 population(s):

1: 10S, 11V, 11Z, 13P, 14F, 14G, 14V, 14Z, 1A, 1D, 1L, 1R, 1V, 1Xa, 1Xb, 20M, 22N, 22Q, 23Q, 24F, 24N, 24Q, 25L, 25Q, 25Y, 26C,

27C, 29C, 2D, 2P, 2W, 2X, 30L, 31L, 31R, 32R, 33A, 33N, 34C, 34M, 361, 362, 38A, 3D, 3W, 3X, 3Z, 473, 47A, 4G, 4N, 4S, 4W,

4X, 517, 5G, 5W, 68A, 69A, 6G, 6N, 6W, 7F, 7N, 7W, 7Y, 7Z, 8N, 8R, 8X, 9R, 9Z, Coromandel_rep, HB1a, HB1b, Hanmer_rep, Murchison_rep,

Otago_rep, RI1971, RI2731, SI355, SI617B, Taranaki_rep, Trounson_rep, Uruwera_rep, Whangarei_rep

Working on 1 group(s) of populations:

defaultgrp: 1

Raw haplotypes will be written to './stacks/populations.haplotypes.tsv'

Population-level summary statistics will be written to './stacks/populations.sumstats.tsv'

Population-level haplotype summary statistics will be written to './stacks/populations.hapstats.tsv'

Processing data in batches:

* load a batch of catalog loci and apply filters

* compute SNP- and haplotype-wise per-population statistics

* write the above statistics in the output files

* export the genotypes/haplotypes in specified format(s)

More details in './stacks/populations.log.distribs'.

Now processing...

scaffold_1|arrow

scaffold_2|arrow

scaffold_3|arrow

scaffold_4|arrow

scaffold_5|arrow

scaffold_6|arrow

scaffold_7|arrow

scaffold_8|arrow

scaffold_9|arrow

scaffold_10|arrow

scaffold_11|arrow

scaffold_12|arrow

scaffold_13|arrow

scaffold_14|arrow

scaffold_15|arrow

scaffold_16|arrow

scaffold_17|arrow

scaffold_18|arrow

scaffold_19|arrow

scaffold_20|arrow

scaffold_21|arrow

scaffold_22|arrow

scaffold_23|arrow

scaffold_24|arrow

scaffold_25|arrow

scaffold_26|arrow

scaffold_27|arrow

scaffold_28|arrow

scaffold_29|arrow

scaffold_30|arrow

scaffold_31|arrow

scaffold_32|arrow

scaffold_33|arrow

scaffold_34|arrow

scaffold_35|arrow

scaffold_36|arrow

scaffold_37|arrow

scaffold_38|arrow

scaffold_39|arrow

scaffold_40|arrow

scaffold_41|arrow

scaffold_44|arrow

scaffold_46|arrow

scaffold_47|arrow

scaffold_48|arrow

scaffold_49|arrow

scaffold_51|arrow

scaffold_53|arrow

scaffold_54|arrow

scaffold_60|arrow

scaffold_63|arrow

scaffold_64|arrow

scaffold_65|arrow

scaffold_68|arrow

scaffold_69|arrow

scaffold_70|arrow

scaffold_73|arrow

scaffold_79|arrow

scaffold_87|arrow

scaffold_89|arrow

Removed 0 loci that did not pass sample/population constraints from 108039 loci.

Kept 108039 loci, composed of 8101980 sites; 0 of those sites were filtered, 2959 variant sites remained.

6648392 genomic sites, of which 1235437 were covered by multiple loci (18.6%).

Mean genotyped sites per locus: 74.87bp (stderr 0.01).

Population summary statistics (more detail in populations.sumstats_summary.tsv):

1: 1.8401 samples per locus; pi: 0.72052; all/variant/polymorphic sites: 8089047/2959/2959; private alleles: 0

Populations is done.

ref_map.pl is done.

Julian Catchen

unread,

Mar 19, 2020, 3:29:53 PM3/19/20

to stacks...@googlegroups.com, Andrew Veale

Hi Andrew,

As you suggest, and as gstacks is reporting to you, your DartSeq reads
are not mapping to your reference genome. It is the wrong question to
ask, "why is BWA broken," or, "should I use a different mapper?" BWA is
a well-tested read mapper and it will find alignments if they are
present. Instead, you need to be asking, what is wrong with my data such
that my reads don't map as I think they should? Are your reads full of
adapter sequence, or some kind of bacterial contamination? Or, is your
reference genome not polished well enough to properly align Illumina
reads to it? Or some other factor... You could test some of those
questions by doing a de novo analysis and examining the results.

Andrew Veale wrote on 3/19/20 12:01 PM:

> --
> Stacks website: http://catchenlab.life.illinois.edu/stacks/
> ---
> You received this message because you are subscribed to the Google
> Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stacks-users...@googlegroups.com
> <mailto:stacks-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/stacks-users/5075b43d-ce9a-461c-a11a-69119c7083c2%40googlegroups.com
> <https://groups.google.com/d/msgid/stacks-users/5075b43d-ce9a-461c-a11a-69119c7083c2%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Julian Catchen, Ph.D.
Assistant Professor
Department of Evolution, Ecology, and Behavior
Carl R. Woese Institute for Genomic Biology
University of Illinois, Urbana-Champaign
--
jcat...@illinois.edu; @jcatchen

Andrew Veale

unread,

Mar 21, 2020, 12:59:13 AM3/21/20

to Stacks

Hi Julian

Many thanks for the reply! Yes, you're right - they demultiplexed the files, but left the barcodes on and adapters on for some reason. Their pipeline is obviously odd - I hadn't thought to check since they were already demultiplexed. I'm now running all the files through process RADtags to clean them - I'm sure they will map much better shortly!

Thanks!

> an email to stacks...@googlegroups.com
> <mailto:stacks-users+unsub...@googlegroups.com>.

Reply all

Reply to author

Forward