Filtered Genepop file?

709 views
Skip to first unread message

Michael MH

unread,
Jun 4, 2013, 5:11:57 PM6/4/13
to stacks...@googlegroups.com

Hi Julian et al., 

We are working with a species with lots of variation. There are many rare SNPs, and there are usually >1 SNP per RAD locus. For some of our analyses we would like to have a Genepop file filtered so we only have SNPs with MAF of, say > 0.1. Is there a smart way to achieve this? In populations we can choose to exclude SNPs below a certain MAF, but this only affects the Fst output, not the Genepop file. Of course, the Fst file could be used to create a whitelist that could subsequently be used to cut out the SNPs of interest from the Genepop file, but this seems tedious. Would it be possible in populations to add a filter to export only certain SNPs to the Genepop file or is there an alternative good approach?

Any help would be much appreciated.

Best regards,

Michael

Julian Catchen

unread,
Jun 5, 2013, 1:07:26 PM6/5/13
to stacks...@googlegroups.com
Hi Michael,

The current minor allele frequency filter acts when the summary statistics are
being calculated at a locus (other filters act earlier, after the genotypes are
initially loaded). If the MAF is below the threshold, the locus is not removed,
but the minor allele is removed and the locus is set to be homozygous.

These changes carry through to all the statistics calculated for the locus, but
are not carried through in the genepop output (the software goes back to the
original genotype calls and outputs the two alleles).

I can see two options:

1) change the behavior of the MAF filter so that if a locus has a MAF below the
threshold, we delete the locus, so it doesn't contribute at all to the results.
One question would be: do we delete it in one population, or in the whole dataset?

2) we change the GenePop/STRUCTURE/VCF outputs so that the locus is output as a
homozygote when it has a MAF below the threshold.

Beyond changing the behavior of Stacks, you could achieve your desired results
by using a simple script to parse the batch_1.sumstats.tsv file. This file will
tell you the frequency of the two alleles at each locus in each population. You
could create a BLACKLIST of loci you want to exclude based on a few rules and
then feed this back into populations using the -B flag. Then all loci < MAF
threshold will not be processed or output.

Best,

julian

Michael MH

unread,
Jun 5, 2013, 2:04:06 PM6/5/13
to stacks...@googlegroups.com

Hi Julian, 

Thanks for your excellent explanation and suggestions! I think option 1 would be preferable. The reason why I asked the question is that we want to make input files for BayeScan and similar methods and do not want a lot of SNPs that are rare across all populations. So, the criterion should be that MAF for the SNP should be < say 0.1 in all populations before it is excluded; if it is < 0.1 in all but one population and > 0.1 in a single population it might have a biological significance and it should be kept. To answer your question: for our purpose the SNP should be omitted from all samples.

Your suggestion to make a blacklist is very good. However, the particular species that we work with has an awful lot of variation, and most RAD loci have > 1 SNP. If I understand you correctly, we would exclude the whole RAD locus if it has a SNP below a certain MAF, even if there is another SNP with MAF above the threshold. In our specific case this would mean that we lose a lot of data. So, ideally we would want to omit a SNP with MAF below a certain threshold at a locus, but at the same time keep another SNP at the same RAD locus if MAF is above the threshold.

Best regards,

Michael

Julian Catchen

unread,
Jun 7, 2013, 5:25:23 PM6/7/13
to stacks...@googlegroups.com, m...@biology.au.dk
Hi Michael,

I will implement this in the near future. The main problem is that MAF is only
apparent after a lot of calculation and a lot of different structures have been
built internally to hold the data, so it becomes harder to filter since we have
to prune the data out from multiple places. So, I need a good strategy for the code.

Re: blacklists, SNPs within a single RAD locus should almost always be in
perfect linkage, so, if you are going to exclude one SNP due to a low frequency,
you would expect the other SNPs at that locus to have the same frequencies
within a population. But maybe your data indicates otherwise? Also, this could
be violated due to false-positive SNP calls, but we are working on fixing that
through a more general corrections mechanism to double check SNP calls at the
population level.

julian

Michael MH

unread,
Jun 10, 2013, 10:49:45 AM6/10/13
to stacks...@googlegroups.com

Hi Julian,

Thanks a lot! It will make life much easier for us when this is implemented. I am sure other people working with species showing very high levels of variation will also appreciate this. I hope it will not be too technically complicated. 

The species we work with presumably has a very high Ne, and that underlies the high amount of variation. So, in this case there could be both SNPs with relatively high MAF and others with very low MAF at the same RAD locus; due to the high Ne a lot of new mutations are retained within the population. We also have RAD data from other organisms, where there are more “normal” levels of variation, so we do not think the high level of variation we see in this particular species is due to technical artifacts (and we have of course done our best to filter for quality).

Best regards,

Michael


On Tuesday, June 4, 2013 11:11:57 PM UTC+2, Michael MH wrote:

Anna

unread,
Apr 16, 2014, 5:28:17 PM4/16/14
to stacks...@googlegroups.com
Hi Julian and Michael,
I'm going through the same problem. Have you come up with a script, also independent from Stacks is fine, to make a list of loci to discard? 
Thank you
Anna

Julian Catchen

unread,
Apr 19, 2014, 5:50:31 PM4/19/14
to stacks...@googlegroups.com, annat...@gmail.com
Hi Anna,

The minor allele frequency filter implemented in the populations program should apply to all file exports of the program, including GenePop. The code checks the allele frequency of a particular site, and if it is below the threshold a flag is set internally marking the site as "filtered." This flag is then respected in all the file export code. Please let me know if you experience any difficulty with it.

Best,

julian

Anna Tigano

unread,
Apr 22, 2014, 10:53:11 PM4/22/14
to Julian Catchen, stacks...@googlegroups.com
Hi Julian, 
thank you for your reply. I'm currently using v.1.06, is this filtering implemented in this version already?
Best
Anna
--
Anna Tigano, M.Sc.
Ph.D. Candidate
Friesen Lab
Queen's University, Department of Biology
Biosciences Complex, 4441
116 Barrie Street
Kingston, ON K7L 3N6
Canada

Julian Catchen

unread,
Apr 22, 2014, 11:06:17 PM4/22/14
to Anna Tigano, stacks...@googlegroups.com
Hi Anna,

Yes, it should be. Run the populations program without the filter and identify several loci in the batch_X.sumstats.tsv file that have a minor allele frequency below your threshold. Then run populations again using the filter and check that the locus was filtered in the sumstats file. Let me know if it isn't working.

julian

Anna Tigano

unread,
May 1, 2014, 6:22:00 PM5/1/14
to Julian Catchen, stacks-users
Hi Julian,
sorry for the late reply. I thought it had worked but I just realized that the sumstats files are filtered but not the input files for genepop and structure.
Thanks in advance
Anna

Julian Catchen

unread,
May 1, 2014, 6:45:49 PM5/1/14
to stacks...@googlegroups.com, annat...@gmail.com
Hi Anna,

Can you please provide an example of this behavior.

julian

Anna Tigano wrote:
> Hi Julian,
> --
> Stacks website: http://creskolab.uoregon.edu/stacks/
> ---
> You received this message because you are subscribed to the Google
> Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stacks-users...@googlegroups.com
> <mailto:stacks-users...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Julian M Catchen, Ph.D.
Institute of Ecology and Evolution
University of Oregon
--
jcat...@uoregon.edu
http://www.uoregon.edu/~jcatchen/

Anna Tigano

unread,
May 2, 2014, 9:28:15 AM5/2/14
to Julian Catchen, stacks-users
Hi Julian,
here they are. Let me know if you need more files. Also, maybe you've already corrected this in newer versions (I'm using 1.06) but I noticed two little problems in the genepop input files. 
1. loci names should be divided by comma and space (instead of only comma)
2. loci names tend to merge together towards the end of the line, basically commas tend to disappear among the last ~10 loci




For more options, visit https://groups.google.com/d/optout.

--
Julian M Catchen, Ph.D.
Institute of Ecology and Evolution
University of Oregon
--
jcat...@uoregon.edu
http://www.uoregon.edu/~jcatchen/
batch_2.fst_1-2.tsv
batch_2.genepop

Anna Tigano

unread,
May 2, 2014, 10:00:39 AM5/2/14
to Julian Catchen, stacks-users
Maybe this is better...
batch_2.sumstats.tsv

Anna Tigano

unread,
May 2, 2014, 10:07:04 AM5/2/14
to Julian Catchen, stacks-users
I'm really sorry, I figured out what the problem is. I just got confused among files.
Thank you for your help anyway!
Best
Anna
Reply all
Reply to author
Forward
0 new messages