what does "phasing" in gstacks (log) mean?

546 views
Skip to first unread message

toczyd...@wisc.edu

unread,
Mar 15, 2018, 3:30:22 PM3/15/18
to Stacks
Hello,

I have a high level of first-hand experience with both Stacks and ipyrad.  I am not sure how to interpret the below stats about phasing from the gstacks log.  Can someone explain how the term phasing is being used here?  (I am running de novo assemblies with GBS data, diploid organism).

Thanks!
-RT

output from end of gstacks stdout:
Genotyped 1698556 loci:
  effective per-sample coverage: mean=20.1x, stdev=4.0x, min=10.4x, max=43.8x (per locus where sample is present)
  mean number of sites per locus: 89.0
  a consistent phasing was found for 377594 of out 599609 (63.0%) diploid loci needing phasing

Nicolas Rochette

unread,
Mar 15, 2018, 5:21:49 PM3/15/18
to Stacks

Hi toczydlowski,

If a sample is heterozygous for two SNPs in a locus, say A/G at position 45 (counting from the cutsite) and C/T at position 70, the haplotypes (two, because the individual is diploid) could be ...A...C.../...G...T... or ...A...T.../...G...C... : phasing the hets is figuring out which haplotypes exist.

This fails when there's evidence for more than two haplotypes, i.e. for AC, GT and AT at the same time. This shouldn't happen for a diploid sample and indicates that there's a problem with the SNP calls for that sample.

The 63% you get here is indeed fairly low. Could you send me the entire log file so I have an idea of your setup? (Note: removing PCR duplicates, if you have paired-end data, usually improves that figure by a lot.)

Best,

Nicolas

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
Visit this group at https://groups.google.com/group/stacks-users.
For more options, visit https://groups.google.com/d/optout.

toczyd...@wisc.edu

unread,
Mar 15, 2018, 6:54:33 PM3/15/18
to Stacks
Hi Nicholas,

Thanks for the response and info!

I forgot to specify earlier, I have single-end reads.

A few follow-up questions -
What happens to the ~30% of the loci that could not be phased? 
Is the ~600,000 loci value printed in the phasing line of the gstacks output (pasted previously) the N of loci x N individuals, or each individual has ~600,000 loci at this step? 
Does gstacks allow 2 alleles within diploid individuals but more than 2 across individuals?  E.g. Could A,T, and G all be segregating at a locus but only two of those alleles within any given individual?
If I use --lnl_lim -50 in populations, does this "protect" me from whatever the phasing issue is?

I am running this on a cluster using a job scheduler, so I have many individual log files (one per sample and/or per module of Stacks). Do you want them all?  Or were you asking just for the full log from gstacks?  The params of main interest: ustacks m5, maxlocus2, M3, cstacks n3.  These params were originally optimized on a subset of replicate samples using Mastretta-Yanes et al. approach and Stacks1.30.

Thanks,
Rachel 
Reply all
Reply to author
Forward
0 new messages