Stacks Population Genetics Questions

1,679 views
Skip to first unread message

Paul Richards

unread,
Dec 4, 2012, 12:08:09 PM12/4/12
to stacks...@googlegroups.com
Hi everyone,

I have a few questions regarding different aspects of getting a good set of population genetics markers out of STACKS.

(1) Could anyone provide advise on what sort of thresholds they find suitable for the -M cluster distance in ustacks and the -n threshold in cstacks for creating population datasets? Previously I have used 5 and 3 respectively for a separate mapping project but realise it is particularly important to minimise paralogous loci for a population dataset. 

(2) The populations module is very useful, particularly with the options to output the different file formats. However, as far as I am aware I cannot choose loci with a specific number of alleles (e.g. bi-allelic markers would be preferable). Equally I would like to explore exporting loci with a maximum allele frequency specified as I have read in a recent cichlid paper that this can be another way to filter out paralogous loci (e.g. loci with a very high heterozygosity). 

Would running the populations module first, reimporting the results back into an sql database and then running the export.sql script be an option, or is the latter only suitable for linkage map data?

Alternatively, does anyone have any scripts for doing this using the files output by the populations module, that they would be able to share? My scripting skills are very limited at the moment so I wouldn't know where to start myself!

(3) Finally, (apologies if this is basic) but could someone just clarify what exactly are the markers included in the structure and phylip output files? I have presumed they are RAD loci haplotypes and not individual SNPs. Therefore in the case of the phylip files is just one 'token' SNP from each RAD loci output, rather than every SNP found?

Thanks, this group is always very helpful!

Paul

Emiliano

unread,
Dec 4, 2012, 7:28:50 PM12/4/12
to stacks...@googlegroups.com
Hi Paul, 

this is a really important question. My experience is that the fine-tuning of these parameters when analyzing a population with a certain degree of diversity and a complex structure is not really straightforward. I tried some combos and then I decided for a -M 3 and a -n 5 (7 when including a sibling species). 

My idea (but I'll appreciate some advice on it!) is that when -M is too low, two different loci can be created in the single heterozygote individual (exceeding number of matching parents per locus when building the catalog), if -M is too high a lower number of loci will be created per individual as paralogous loci that match together will be discarded (only two clusters of reads per locus are allowed in ustacks). 

When assembling the catalog (-n option), a too high threshold should cause paralogous loci to match the same locus (again exceeding number of matching parents per locus) while a too low one will prevent high-diversity loci to match together causing an overall reduction in the catalog overlap across multiple individuals.   

The right setting depends on the actual diversity that reflects population history in the individuals you want to analyze together. 

Which kind of loci you want to filter out for downstream applications? After exporting a summary table (csv or xls), biallelic loci can be easily filtered as you will see this information (10th column?). I am a script-beginner too but I can try to help if you tell me what you need.

Cheers,
Emiliano

Paul Richards

unread,
Dec 5, 2012, 1:34:56 PM12/5/12
to stacks...@googlegroups.com
Thank you for your advice Emiliano. If I am correct in my understanding that the -M and -n distances work by comparing each stack pairwise, then I guess no more than -M 2 -n  3 would be suitable, given that I am working with 50bp reads - if you or anyone else could clarify that my understanding is correct I would be grateful!

I see now that filtering out biallelic markers should be fairly straightforward, but thank you for the offer of help.

One other question I have - is there is any reason to have the --max_locus_stacks  parameter in ustacks set any higher than 2 in a diploid organism? I assume this refers to the resulting stacks (alleles) at a locus in an individual after multiple stacks have merged, as when I set this to 2 with -M 2 I have up to 4 SNPs for a given locus.  

Thanks, Paul




--
For more options or to unsubscribe: http://groups.google.com/group/stacks-users
Stacks website: http://creskolab.uoregon.edu/stacks/

Julian Catchen

unread,
Dec 5, 2012, 2:14:47 PM12/5/12
to stacks...@googlegroups.com
Hi Paul,

On 12/5/12 10:34 AM, Paul Richards wrote:
> One other question I have - is there is any reason to have the
> --max_locus_stacks parameter in ustacks set any higher than 2 in a diploid
> organism? I assume this refers to the resulting stacks (alleles) at a locus in
> an individual after multiple stacks have merged, as when I set this to 2 with -M
> 2 I have up to 4 SNPs for a given locus.

Intuitively, you are correct, you should not have more than 2 stacks per locus.
However, Illumina data are very noisy and full of error and frequently you will
get convergent errors in reads that will form small stacks. These are not a big
deal as the SNP model can handle the error. However, if you restrict
--max_locus_stacks to 2, you will blacklist these loci. By default,
--max_locus_stacks is set to 3. Of course, I would recommend running your data
with the parameter set to both values, you can then compare the effect.

Best,

julian

Paul Richards

unread,
Dec 5, 2012, 2:40:45 PM12/5/12
to stacks...@googlegroups.com
Thanks Julian. However, I am still unsure as to how I end up with up to 4 SNPs at a locus if I am only allowing a distance of two between a maximum of two stacks?
Furthermore, would you mind briefly describing the exact process by which mismatching works when building the catalog?

Would you also mind commenting on this original question I posted...What are the markers included in the structure and phylip output files? I have presumed they are RAD loci haplotypes and not individual SNPs that are variable between populations. Therefore in the case of the phylip files is just one 'token' SNP from each RAD loci output, rather than every SNP found?

Sorry to shower you with questions!

Cheers, Paul

P.S. Really appreciate the work you have done on the populations module of stacks - it has been very useful for us!

Julian Catchen

unread,
Dec 7, 2012, 2:38:38 AM12/7/12
to stacks...@googlegroups.com
Hi Paul,

On 12/5/12 11:40 AM, Paul Richards wrote:
> I am still unsure as to how I end up with up to 4 SNPs
> at a locus if I am only allowing a distance of two between a maximum of two stacks?

Are you ending up with 4 SNPs at a locus in an individual sample, or in the
locus in the catalog (which is the synthesis of the data in all samples)? If
it's the latter, then I assume you are setting -M 2 and -n 2. This allows two
differences within an individual (handled by ustacks), and then up to two fixed
differences between individuals (as determined by cstacks when making the catalog).

> Furthermore, would you mind briefly describing the exact process by which
> mismatching works when building the catalog?

Check out the original Stacks paper. We use the same k-mer matching algorithm in
both ustacks and cstacks.

> Would you also mind commenting on this original question I posted...What are the
> markers included in the structure and phylip output files? I have presumed they
> are RAD loci haplotypes and not individual SNPs that are variable between
> populations. Therefore in the case of the phylip files is just one 'token' SNP
> from each RAD loci output, rather than every SNP found?
>

The data in the Phylip file are individual SNPs that are fixed in individual
samples but vary across samples (we need fixed differences to satisfy
phylogenetic models when building trees). These SNPs are concatenated together
to construct the Phylip file (a log file is also generated that specifies which
loci/nucleotide each SNP is taken from).

The Structure file is the first SNP from each catalog locus (since any
subsequent SNPs at the locus are in linkage to the first SNP and STRUCTURE does
not want linked data). You could convert the batch_X.haplotypes.tsv file into a
STRUCTURE format without too much difficulty if you want haplotype data.

These issues are covered in detail in our upcoming Stacks 1.0 paper.

Best,

julian
Reply all
Reply to author
Forward
0 new messages