gstacks removes almost all reads (BWA issue with mapping??)

118 views
Skip to first unread message

Andrew Veale

unread,
Mar 19, 2020, 1:01:42 PM3/19/20
to Stacks
Hi All,

I have some GBS reads from DartSeq that I wanted to re-assemble now that I have a high quality genome.  The genome is chromosome level quality so it shouldn't have any issues.  So I took the de-multiplexed reads they gave me (FASTQ interleaved) and mapped them to the genome with BWA producing SAM files, then transformed these into sorted BAM files, then ran ref_map.pl with default settings.  

It looks like basically all GBS reads were either unmapped, mapped with insufficient quality, or excessively soft clipped.  I'm guessing this is a BWA issue?  Should I use Bowtie2 instead without soft clipping?

Given that almost all loci appear in only 1 - 2 individuals the output is obviously unusable.

Any help would be appreciated!

log file results below:

/usr/local/bin/ref_map.pl -T 15 -o ./stacks --popmap Stoat_population_map1.txt --samples ./Alignments

gstacks
==========
/usr/local/bin/gstacks -I ./Alignments/ -M Stoat_population_map1.txt -O ./stacks -t 15

Logging to './stacks/gstacks.log'.
Locus/sample distributions will be written to './stacks/gstacks.log.distribs'.

Configuration for this run:
  Input mode: reference-based
  Population map: 'Stoat_population_map1.txt'
  Input files: 86, e.g. './Alignments/10S.bam'
  Output to: './stacks/'
  Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)

Reading BAM headers...
Processing all loci...
1K...
2K...
5K...
10K...
20K...
50K...
100K...
done.

Read 203688022 BAM records:
  kept 421469 primary alignments (0.2%), of which 0 reverse reads
  skipped 30398146 primary alignments with insufficient mapping qualities (14.9%)
  skipped 118010299 excessively soft-clipped primary alignments (57.9%)
  skipped 54819320 unmapped reads (26.9%)
  skipped some suboptimal (secondary/supplementary) alignment records

  Per-sample stats (details in 'gstacks.log.distribs'):
    read 2368465.4 records/sample (1892566-2624233)
    kept 0.0%-1.2% of these

Built 108039 loci comprising 421469 forward reads and 0 matching paired-end reads; mean insert length was 0.0 (sd: nan).

Genotyped 108039 loci:
  effective per-sample coverage: mean=3.7x, stdev=1.1x, min=1.0x, max=11.4x
  mean number of sites per locus: 74.9
  a consistent phasing was found for 94 of out 94 (100.0%) diploid loci needing phasing

gstacks is done.

populations
==========
/usr/local/bin/populations -P ./stacks -t 15 -M Stoat_population_map1.txt

Logging to './stacks/populations.log'.
Locus/sample distributions will be written to './stacks/populations.log.distribs'.
populations parameters selected:
  Percent samples limit per population: 0
  Locus Population limit: 1
  Percent samples overall: 0
  Minor allele frequency cutoff: 0
  Maximum observed heterozygosity cutoff: 1
  Applying Fst correction: none.
  Pi/Fis kernel smoothing: off
  Fstats kernel smoothing: off
  Bootstrap resampling: off

Parsing population map...
The population map contained 86 samples, 1 population(s), 1 group(s).
Working on 86 samples.
Working on 1 population(s):
    1: 10S, 11V, 11Z, 13P, 14F, 14G, 14V, 14Z, 1A, 1D, 1L, 1R, 1V, 1Xa, 1Xb, 20M, 22N, 22Q, 23Q, 24F, 24N, 24Q, 25L, 25Q, 25Y, 26C, 
       27C, 29C, 2D, 2P, 2W, 2X, 30L, 31L, 31R, 32R, 33A, 33N, 34C, 34M, 361, 362, 38A, 3D, 3W, 3X, 3Z, 473, 47A, 4G, 4N, 4S, 4W, 
       4X, 517, 5G, 5W, 68A, 69A, 6G, 6N, 6W, 7F, 7N, 7W, 7Y, 7Z, 8N, 8R, 8X, 9R, 9Z, Coromandel_rep, HB1a, HB1b, Hanmer_rep, Murchison_rep, 
       Otago_rep, RI1971, RI2731, SI355, SI617B, Taranaki_rep, Trounson_rep, Uruwera_rep, Whangarei_rep
Working on 1 group(s) of populations:
    defaultgrp: 1

Raw haplotypes will be written to './stacks/populations.haplotypes.tsv'
Population-level summary statistics will be written to './stacks/populations.sumstats.tsv'
Population-level haplotype summary statistics will be written to './stacks/populations.hapstats.tsv'

Processing data in batches:
  * load a batch of catalog loci and apply filters
  * compute SNP- and haplotype-wise per-population statistics
  * write the above statistics in the output files
  * export the genotypes/haplotypes in specified format(s)
More details in './stacks/populations.log.distribs'.
Now processing...
scaffold_1|arrow 
scaffold_2|arrow 
scaffold_3|arrow 
scaffold_4|arrow 
scaffold_5|arrow 
scaffold_6|arrow 
scaffold_7|arrow 
scaffold_8|arrow 
scaffold_9|arrow 
scaffold_10|arrow 
scaffold_11|arrow 
scaffold_12|arrow 
scaffold_13|arrow 
scaffold_14|arrow 
scaffold_15|arrow 
scaffold_16|arrow 
scaffold_17|arrow 
scaffold_18|arrow 
scaffold_19|arrow 
scaffold_20|arrow 
scaffold_21|arrow 
scaffold_22|arrow 
scaffold_23|arrow 
scaffold_24|arrow 
scaffold_25|arrow 
scaffold_26|arrow 
scaffold_27|arrow 
scaffold_28|arrow 
scaffold_29|arrow 
scaffold_30|arrow 
scaffold_31|arrow 
scaffold_32|arrow 
scaffold_33|arrow 
scaffold_34|arrow 
scaffold_35|arrow 
scaffold_36|arrow 
scaffold_37|arrow 
scaffold_38|arrow 
scaffold_39|arrow 
scaffold_40|arrow 
scaffold_41|arrow 
scaffold_44|arrow 
scaffold_46|arrow 
scaffold_47|arrow 
scaffold_48|arrow 
scaffold_49|arrow 
scaffold_51|arrow 
scaffold_53|arrow 
scaffold_54|arrow 
scaffold_60|arrow 
scaffold_63|arrow 
scaffold_64|arrow 
scaffold_65|arrow 
scaffold_68|arrow 
scaffold_69|arrow 
scaffold_70|arrow 
scaffold_73|arrow 
scaffold_79|arrow 
scaffold_87|arrow 
scaffold_89|arrow 

Removed 0 loci that did not pass sample/population constraints from 108039 loci.
Kept 108039 loci, composed of 8101980 sites; 0 of those sites were filtered, 2959 variant sites remained.
    6648392 genomic sites, of which 1235437 were covered by multiple loci (18.6%).
Mean genotyped sites per locus: 74.87bp (stderr 0.01).

Population summary statistics (more detail in populations.sumstats_summary.tsv):
  1: 1.8401 samples per locus; pi: 0.72052; all/variant/polymorphic sites: 8089047/2959/2959; private alleles: 0
Populations is done.
ref_map.pl is done.

Julian Catchen

unread,
Mar 19, 2020, 3:29:53 PM3/19/20
to stacks...@googlegroups.com, Andrew Veale
Hi Andrew,

As you suggest, and as gstacks is reporting to you, your DartSeq reads
are not mapping to your reference genome. It is the wrong question to
ask, "why is BWA broken," or, "should I use a different mapper?" BWA is
a well-tested read mapper and it will find alignments if they are
present. Instead, you need to be asking, what is wrong with my data such
that my reads don't map as I think they should? Are your reads full of
adapter sequence, or some kind of bacterial contamination? Or, is your
reference genome not polished well enough to properly align Illumina
reads to it? Or some other factor... You could test some of those
questions by doing a de novo analysis and examining the results.

Andrew Veale wrote on 3/19/20 12:01 PM:
> --
> Stacks website: http://catchenlab.life.illinois.edu/stacks/
> ---
> You received this message because you are subscribed to the Google
> Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stacks-users...@googlegroups.com
> <mailto:stacks-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/stacks-users/5075b43d-ce9a-461c-a11a-69119c7083c2%40googlegroups.com
> <https://groups.google.com/d/msgid/stacks-users/5075b43d-ce9a-461c-a11a-69119c7083c2%40googlegroups.com?utm_medium=email&utm_source=footer>.


--
Julian Catchen, Ph.D.
Assistant Professor
Department of Evolution, Ecology, and Behavior
Carl R. Woese Institute for Genomic Biology
University of Illinois, Urbana-Champaign
--
jcat...@illinois.edu; @jcatchen

Andrew Veale

unread,
Mar 21, 2020, 12:59:13 AM3/21/20
to Stacks
Hi Julian

Many thanks for the reply!  Yes, you're right - they demultiplexed the files, but left the barcodes on and adapters on for some reason.  Their pipeline is obviously odd - I hadn't thought to check since they were already demultiplexed.  I'm now running all the files through process RADtags to clean them - I'm sure they will map much better shortly!

Thanks!
Reply all
Reply to author
Forward
0 new messages