Low coverage but high number of shared loci

601 views
Skip to first unread message

Brenna Levine

unread,
Jan 21, 2020, 10:21:53 AM1/21/20
to Stacks
Hi Julian,

A couple months ago I emailed you about a ddRAD library that we paired-end sequenced that ended up with a high number of reads that didn't stacks. We have since concluded that this was in part due to contamination with the DNA if an obligate endosymbiont. Your recommendation was to align these to the reference genome for the species and take the data through gstacks and populations using stacks version 2.41 to see coverage and the proportion of shared loci. It took me a while to get around to it because I had a baby! But now I'm back and I'm hoping that I could get your perspective on our results.

We aligned the reads for each individual to the reference genome of the species using BWA and took the alignments through gstacks. Here is the gstacks log:

gstacks v2.41, executed 2020-01-16 20:21:49 (zlib-1.2.11)
gstacks -I ./ -O ../stacks.bwa/ -M ../../../info/popmap.ORD.tsv -t 8
Locus/sample distributions will be written to '../stacks.bwa/gstacks.log.distribs'.

Configuration for this run:
  Input mode: reference-based
  Population map: '../../../info/popmap.ORD.tsv'
  Input files: 67, e.g. './ORD01-01-F2.bam'
  Output to: '../stacks.bwa/'
  Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)

Reading BAM headers...
Processing all loci...
1K...
2K...
5K...
10K...
20K...
50K...
100K...
200K...
500K...
1000K...
done.

Read 685861016 BAM records:
  kept 324434113 primary alignments (57.0%), of which 161804808 reverse reads
  skipped 78383461 primary alignments with insufficient mapping qualities (13.8%)
  skipped 129298499 excessively soft-clipped primary alignments (22.7%)
  skipped 37551369 unmapped reads (6.6%)
  skipped some suboptimal (secondary/supplementary) alignment records

  Per-sample stats (details in 'gstacks.log.distribs'):
    read 10236731.6 records/sample (3423793-19432142)
    kept 25.9%-70.5% of these

Built 1639926 loci comprising 162629305 forward reads and 96709526 matching paired-end reads; mean insert length was 284.5 (sd: 40.6).

Genotyped 1639926 loci:
  effective per-sample coverage: mean=7.0x, stdev=1.9x, min=3.0x, max=13.0x
  mean number of sites per locus: 221.0
  a consistent phasing was found for 1345641 of out 1412813 (95.2%) diploid loci needing phasing

gstacks is done.


As we expected, our coverage was low (mean = 7x). However, when we took the gstacks output through populations, we got a high number of shared loci when setting r=0.95. The following is our populations.sumstats_summary.tsv output:

# Variant positions
# Pop ID Private Num_Indv Var StdErr P Var StdErr Obs_Het Var StdErr Obs_Hom Var StdErr Exp_Het Var StdErr Exp_Hom Var StdErr Pi Var StdErr Fis Var StdErr
ORD 0 65.56448 1.29145 0.00562 0.91055 0.01488 0.00060 0.07347 0.00804 0.00044 0.92653 0.00804 0.00044 0.13314 0.02315 0.00075 0.86686 0.02315 0.00075 0.13416 0.02351 0.00076 0.29155 0.10627 0.00562
# All positions (variant and fixed)
# Pop ID Private Sites Variant_Sites Polymorphic_Sites %Polymorphic_Loci Num_Indv Var StdErr P Var StdErr Obs_Het Var StdErr Obs_Hom Var StdErr Exp_Het Var StdErr Exp_Hom Var StdErr Pi Var StdErr Fis Var StdErr
ORD 0 16352266 40919 40919 0.25023 65.89273 1.29439 0.00028 0.99978 0.00006 0.00000 0.00018 0.00003 0.00000 0.99982 0.00003 0.00000 0.00033 0.00010 0.00000 0.99967 0.00010 0.00000 0.00034 0.00010 0.00000 0.00073 0.00048 0.00028

And here is our populations log:

populations v2.41, executed 2020-01-17 14:10:48 (zlib-1.2.11)
populations -P ../stacks.bwa/ -O ../stacks.bwa/r95/ -M ../../../info/popmap.ORD.tsv -r 0.95 --hwe --structure --genepop --write-single-snp -t 8
Locus/sample distributions will be written to '../stacks.bwa/r95/populations.log.distribs'.
populations parameters selected:
  Percent samples limit per population: 0.95
  Locus Population limit: 1
  Percent samples overall: 0
  Minor allele frequency cutoff: 0
  Maximum observed heterozygosity cutoff: 1
  Applying Fst correction: none.
  Pi/Fis kernel smoothing: off
  Fstats kernel smoothing: off
  Bootstrap resampling: off

Parsing population map...
The population map contained 67 samples, 1 population(s), 1 group(s).
Working on 67 samples.
Working on 1 population(s):
    ORD: ORD01-01-F2, ORD01-02-M, ORD01-03-N, ORD01-04-F, ORD01-06-F, ORD01-07-F, ORD01-08-M, ORD01-10-N, ORD01-12-F, ORD01-14-F, 
         ORD01-15-N, ORD01-18-M, ORD01-21-N, ORD01-23-N, ORD03-01-M, ORD03-02-F, ORD03-04-F, ORD03-05-F, ORD04-01-N, ORD04-02-M, ORD04-03-M, 
         ORD04-05-F, ORD04-06-F, ORD04-09-F, ORD04-11-F, ORD04-13-M, ORD05-01-M, ORD05-02-N, ORD05-03-F, ORD05-04-M, ORD06-01-M, ORD06-04-F, 
         ORD06-05-F, ORD06-06-F, ORD06-07-F, ORD06-09-F, ORD06-10-N, ORD06-12-M, ORD06-13-M, ORD07-01-M, ORD07-02-F, ORD07-03-N, ORD08-01-F, 
         ORD08-02-F, ORD08-03-N, ORD08-04-N, ORD08-05-M, ORD08-06-M, ORD08-07-F, ORD08-08-N, ORD08-09-M, ORD08-10-F, ORD08-11-N, ORD08-12-F, 
         ORD08-13-M, ORD08-14-N, ORD08-15-F, ORD08-17-N, ORD08-19-N, ORD08-20-M, ORD08-21-N, ORD08-22-F, ORD08-23-F, ORD08-24-F, ORD08-25-F, 
         ORD08-26-N, ORD08-27-F
Working on 1 group(s) of populations:
    defaultgrp: ORD

Polymorphic sites in Structure format will be written to '../stacks.bwa/r95/populations.structure'
Polymorphic sites in GenePop format will be written to '../stacks.bwa/r95/populations.snps.genepop'
Warning: Genepop: The order in which samples appear was modified (as the input population map is not sorted).Genotyping markers will be written to '../stacks.bwa/r95/populations.markers.tsv'
Raw Genotypes/Haplotypes will be written to '../stacks.bwa/r95/populations.haplotypes.tsv'
Population-level summary statistics will be written to '../stacks.bwa/r95/populations.sumstats.tsv'
Population-level haplotype summary statistics will be written to '../stacks.bwa/r95/populations.hapstats.tsv'

Processing data in batches:
  * load a batch of catalog loci and apply filters
  * compute SNP- and haplotype-wise per-population statistics
  * compute SNP- and haplotype-wise deviation from HWE
  * write the above statistics in the output files
  * export the genotypes/haplotypes in specified format(s)
More details in '../stacks.bwa/r95/populations.log.distribs'.
Now processing...
JHUN02000516.2 
KZ486247.1 
KZ486248.1 
KZ486249.1 
KZ486250.1 
KZ486251.1 
..................
Removed 1585388 loci that did not pass sample/population constraints from 1639926 loci.
Kept 54538 loci, composed of 16514777 sites; 131190 of those sites were filtered, 40919 variant sites remained.
    16462355 genomic sites, of which 48949 were covered by multiple loci (0.3%).
Mean genotyped sites per locus: 299.83bp (stderr 0.12).

Population summary statistics (more detail in populations.sumstats_summary.tsv):
  ORD: 65.564 samples per locus; pi: 0.13416; all/variant/polymorphic sites: 16352266/40919/40919; private alleles: 0

Number of variable sites found to be significantly out of Hardy-Weinberg equilibrium (<0.05):
  ORD: 18384
Number of loci found to be significantly out of Hardy-Weinberg equilibrium (<0.05):
  ORD: 29090
(more detail in populations.sumstats.tsv and populations.hapstats.tsv)
Populations is done.

My question is: even though these loci have low coverage, can we have reasonable confidence that they are not spurious since each is shared among 95% of individuals in the population? We need to avoid additional sequencing if we can.

Thanks, and let me know if additional information would be helpful!

Brenna Levine
University of Tulsa

Catchen, Julian

unread,
Jan 21, 2020, 6:16:29 PM1/21/20
to stacks...@googlegroups.com, Brenna Levine
Hi Brenna,

Congratulations on the birth of your new child!

With loci shared among 95% of your samples, I would expect them to
belong to the species you want. Looking at gstacks, only 57% of the data
aligned well to your reference genome, so I would still be curious what
the other 43% aligns to? If you know the symbiont and there is sequence
out there, you could use BLAST to search your 95%-shared RAD loci
against it to see if they match. Or, you could do the opposite, and
align the 43% of reads that didn't go to your reference genome against
the symbiont genome to make sure you are filtering that stuff out (or
you could use a proxy genome).

The main downside of low coverage (besides allelic drop out, which you
appear to not have too much of), is a homozygous bias in your SNP calls.
Since the mean depth is 7x, there will be many loci with less coverage,
and the SNP model will fail to call the heterozygotes correctly at a
higher rate as coverage decreases (some of these will be miscalled as
homozygous, but only because one of the two true alleles was sequenced
due to low coverage). But, there is not much you can do about this
except include the caveat in your results and not base your particular
biological conclusions on it!

Best,

julian

Brenna Levine wrote on 1/21/20 9:21 AM:
> Hi Julian,
>
> A couple months ago I emailed you about a ddRAD library that we
> paired-end sequenced that ended up with a high number of reads that
> didn't stacks. We have since concluded that this was in part due to
> contamination with the DNA if an obligate endosymbiont. Your
> recommendation was to align these to the reference genome for the
> species and take the data through gstacks and populations using stacks
> version 2.41 to see coverage and the proportion of shared loci. It took
> me a while to get around to it because I had a baby! But now I'm back
> and I'm hoping that I could get your perspective on our results.
>
> We aligned the reads for each individual to the reference genome of the
> species using BWA and took the alignments through gstacks. Here is the
> gstacks log:
>
> /gstacks v2.41, executed 2020-01-16 20:21:49 (zlib-1.2.11)/
> /gstacks -I ./ -O ../stacks.bwa/ -M ../../../info/popmap.ORD.tsv -t 8/
> /Locus/sample distributions will be written to
> '../stacks.bwa/gstacks.log.distribs'./
> /
> /
> /Configuration for this run:/
> /  Input mode: reference-based/
> /  Population map: '../../../info/popmap.ORD.tsv'/
> /  Input files: 67, e.g. './ORD01-01-F2.bam'/
> /  Output to: '../stacks.bwa/'/
> /  Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)/
> /
> /
> /Reading BAM headers.../
> /Processing all loci.../
> /1K.../
> /2K.../
> /5K.../
> /10K.../
> /20K.../
> /50K.../
> /100K.../
> /200K.../
> /500K.../
> /1000K.../
> /done./
> /
> /
> /Read 685861016 BAM records:/
> /  kept 324434113 primary alignments (57.0%), of which 161804808 reverse
> reads/
> /  skipped 78383461 primary alignments with insufficient mapping
> qualities (13.8%)/
> /  skipped 129298499 excessively soft-clipped primary alignments (22.7%)/
> /  skipped 37551369 unmapped reads (6.6%)/
> /  skipped some suboptimal (secondary/supplementary) alignment records/
> /
> /
> /  Per-sample stats (details in 'gstacks.log.distribs'):/
> /    read 10236731.6 records/sample (3423793-19432142)/
> /    kept 25.9%-70.5% of these/
> /
> /
> /Built 1639926 loci comprising 162629305 forward reads and 96709526
> matching paired-end reads; mean insert length was 284.5 (sd: 40.6)./
> /
> /
> /Genotyped 1639926 loci:/
> /  effective per-sample coverage: mean=7.0x, stdev=1.9x, min=3.0x,
> max=13.0x/
> /  mean number of sites per locus: 221.0/
> /  a consistent phasing was found for 1345641 of out 1412813 (95.2%)
> diploid loci needing phasing/
> /
> /
> /gstacks is done./
>
>
> As we expected, our coverage was low (mean = 7x). However, when we took
> the gstacks output through populations, we got a high number of shared
> loci when setting r=0.95. The following is our
> populations.sumstats_summary.tsv output:
>
> /# Variant positions/
> /# Pop
> IDPrivateNum_IndvVarStdErrPVarStdErrObs_HetVarStdErrObs_HomVarStdErrExp_HetVarStdErrExp_HomVarStdErrPiVarStdErrFisVarStdErr/
> /ORD065.564481.291450.005620.910550.014880.000600.073470.008040.000440.926530.008040.000440.133140.023150.000750.866860.023150.000750.134160.023510.000760.291550.106270.00562/
> /# All positions (variant and fixed)/
> /# Pop
> IDPrivateSitesVariant_SitesPolymorphic_Sites%Polymorphic_LociNum_IndvVarStdErrPVarStdErrObs_HetVarStdErrObs_HomVarStdErrExp_HetVarStdErrExp_HomVarStdErrPiVarStdErrFisVarStdErr/
> /ORD01635226640919409190.2502365.892731.294390.000280.999780.000060.000000.000180.000030.000000.999820.000030.000000.000330.000100.000000.999670.000100.000000.000340.000100.000000.000730.000480.00028/
>
> And here is our populations log:
> /
> /
> /populations v2.41, executed 2020-01-17 14:10:48 (zlib-1.2.11)/
> /populations -P ../stacks.bwa/ -O ../stacks.bwa/r95/ -M
> ../../../info/popmap.ORD.tsv -r 0.95 --hwe --structure --genepop
> --write-single-snp -t 8/
> /Locus/sample distributions will be written to
> '../stacks.bwa/r95/populations.log.distribs'./
> /populations parameters selected:/
> /  Percent samples limit per population: 0.95/
> /  Locus Population limit: 1/
> /  Percent samples overall: 0/
> /  Minor allele frequency cutoff: 0/
> /  Maximum observed heterozygosity cutoff: 1/
> /  Applying Fst correction: none./
> /  Pi/Fis kernel smoothing: off/
> /  Fstats kernel smoothing: off/
> /  Bootstrap resampling: off/
> /
> /
> /Parsing population map.../
> /The population map contained 67 samples, 1 population(s), 1 group(s)./
> /Working on 67 samples./
> /Working on 1 population(s):/
> /    ORD: ORD01-01-F2, ORD01-02-M, ORD01-03-N, ORD01-04-F, ORD01-06-F,
> ORD01-07-F, ORD01-08-M, ORD01-10-N, ORD01-12-F, ORD01-14-F, /
> /         ORD01-15-N, ORD01-18-M, ORD01-21-N, ORD01-23-N, ORD03-01-M,
> ORD03-02-F, ORD03-04-F, ORD03-05-F, ORD04-01-N, ORD04-02-M, ORD04-03-M, /
> /         ORD04-05-F, ORD04-06-F, ORD04-09-F, ORD04-11-F, ORD04-13-M,
> ORD05-01-M, ORD05-02-N, ORD05-03-F, ORD05-04-M, ORD06-01-M, ORD06-04-F, /
> /         ORD06-05-F, ORD06-06-F, ORD06-07-F, ORD06-09-F, ORD06-10-N,
> ORD06-12-M, ORD06-13-M, ORD07-01-M, ORD07-02-F, ORD07-03-N, ORD08-01-F, /
> /         ORD08-02-F, ORD08-03-N, ORD08-04-N, ORD08-05-M, ORD08-06-M,
> ORD08-07-F, ORD08-08-N, ORD08-09-M, ORD08-10-F, ORD08-11-N, ORD08-12-F, /
> /         ORD08-13-M, ORD08-14-N, ORD08-15-F, ORD08-17-N, ORD08-19-N,
> ORD08-20-M, ORD08-21-N, ORD08-22-F, ORD08-23-F, ORD08-24-F, ORD08-25-F, /
> /         ORD08-26-N, ORD08-27-F/
> /Working on 1 group(s) of populations:/
> /    defaultgrp: ORD/
> /
> /
> /Polymorphic sites in Structure format will be written to
> '../stacks.bwa/r95/populations.structure'/
> /Polymorphic sites in GenePop format will be written to
> '../stacks.bwa/r95/populations.snps.genepop'/
> /Warning: Genepop: The order in which samples appear was modified (as
> the input population map is not sorted).Genotyping markers will be
> written to '../stacks.bwa/r95/populations.markers.tsv'/
> /Raw Genotypes/Haplotypes will be written to
> '../stacks.bwa/r95/populations.haplotypes.tsv'/
> /Population-level summary statistics will be written to
> '../stacks.bwa/r95/populations.sumstats.tsv'/
> /Population-level haplotype summary statistics will be written to
> '../stacks.bwa/r95/populations.hapstats.tsv'/
> /
> /
> /Processing data in batches:/
> /  * load a batch of catalog loci and apply filters/
> /  * compute SNP- and haplotype-wise per-population statistics/
> /  * compute SNP- and haplotype-wise deviation from HWE/
> /  * write the above statistics in the output files/
> /  * export the genotypes/haplotypes in specified format(s)/
> /More details in '../stacks.bwa/r95/populations.log.distribs'./
> /Now processing.../
> /JHUN02000516.2 /
> /KZ486247.1 /
> /KZ486248.1 /
> /KZ486249.1 /
> /KZ486250.1 /
> /KZ486251.1 /
> /................../
> /Removed 1585388 loci that did not pass sample/population constraints
> from 1639926 loci./
> /Kept 54538 loci, composed of 16514777 sites; 131190 of those sites were
> filtered, 40919 variant sites remained./
> /    16462355 genomic sites, of which 48949 were covered by multiple
> loci (0.3%)./
> /Mean genotyped sites per locus: 299.83bp (stderr 0.12)./
> /
> /
> /Population summary statistics (more detail in
> populations.sumstats_summary.tsv):/
> /  ORD: 65.564 samples per locus; pi: 0.13416; all/variant/polymorphic
> sites: 16352266/40919/40919; private alleles: 0/
> /
> /
> /Number of variable sites found to be significantly out of
> Hardy-Weinberg equilibrium (<0.05):/
> /  ORD: 18384/
> /Number of loci found to be significantly out of Hardy-Weinberg
> equilibrium (<0.05):/
> /  ORD: 29090/
> /(more detail in populations.sumstats.tsv and populations.hapstats.tsv)/
> /Populations is done./
> /
> /
> My question is: even though these loci have low coverage, can we have
> reasonable confidence that they are not spurious since each is shared
> among 95% of individuals in the population? We need to avoid additional
> sequencing if we can.
>
> Thanks, and let me know if additional information would be helpful!
>
> Brenna Levine
> University of Tulsa
>
> --
> Stacks website: http://catchenlab.life.illinois.edu/stacks/
> ---
> You received this message because you are subscribed to the Google
> Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stacks-users...@googlegroups.com
> <mailto:stacks-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/stacks-users/038fff2f-b44b-4d13-a1fc-fc0fec81a6e8%40googlegroups.com
> <https://groups.google.com/d/msgid/stacks-users/038fff2f-b44b-4d13-a1fc-fc0fec81a6e8%40googlegroups.com?utm_medium=email&utm_source=footer>.


--
Julian M Catchen, Ph.D.
Assistant Professor
Department of Evolution, Ecology, and Behavior
Carl R. Woese Institute for Genomic Biology
University of Illinois, Urbana-Champaign
--
jcat...@illinois.edu; @jcatchen

Brenna Levine

unread,
Jan 24, 2020, 12:16:45 PM1/24/20
to Stacks
Hi Julian,

This is so helpful. Thanks a lot!

Brenna
Reply all
Reply to author
Forward
0 new messages