SNP filtering after STACKS with vcftools

99 views
Skip to first unread message

Laura Maria Eslava

unread,
Jan 3, 2023, 9:06:55 AM1/3/23
to Stacks
Hi everyone, 

I did all of my stacks process for ddRAD data and obtained a SNPs file with vcf format, I'm trying to filter some of the SNPs in this file using vcftools but every filter I apply will result in no snps being discarded. I think my vcf file is not in the format vcftools requires, particularly the order of the FORMAT column which goes:
GT:DP:AD:GQ:GL

Genotype : Total Depth : Total Depth for Each Allele : Genotype Quality : Genotype Likelihood

I'm not sure if this format is the problem, is there any way to change this format to one that vcftools can filter? Or any other filtering program I could use specific for the STACKS output? 

I also made a file of SNPs I would like to keep due to not being able to filter them initially but I'm not sure on how to input this file to filter them out of the original vcf.

Please help, I'm new to this and I can't find how to do it properly. 

Thanks,
Laura

Catchen, Julian

unread,
Jan 3, 2023, 11:19:46 AM1/3/23
to stacks...@googlegroups.com

Hi Laura,

 

The VCF format specifies that the author of the file has to define the fields that are included in the header. Stacks does this, by default including the following fields:

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allele Depth">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

 

And, in each entry of the file, the included fields are listed in column 9, e.g. “GT:DP:AD:GQ:GL”.

 

So, you can verify the format of your VCF file directly. When you say that your specified filters “result in no snps being discarded” you can manually check the filtering. That is, if you look at any of the outputs unfiltered (I recommend the populations.sumstats.tsv file), you should be able to identify specific SNPs that should be filtered based on some criteria you are interested in. If you would like specific help on a filtering problem, or think you have found a bug, then please find an example of a SNP you think should be filtered but is not filtered and send that to us.

 

I would also mention that the populations program can do a lot of filtering itself (though not by allele depth, which we think is not a great filter).

 

If you made a ‘whitelist’ of SNPs, please tell us how you did it, what it looks like, and why it not working (in principles, you can use a whitelist by just specifying --whitelist filepath to the populations program).

 

Best,

 

julian

Nguyen Vu

unread,
Jan 3, 2023, 6:34:25 PM1/3/23
to stacks...@googlegroups.com
Hi Laura,

Bcftools could be the one that you are looking for, for instance.
bcftools filter -e 'MEAN(FORMAT/DP)<=2 ||  MEAN(FORMAT/GQ) <= 20' original.vcf > filte.vcf
MEAN(FORMAT/DP)<=2 calculates mean of DP over genotypes and exclude (-e) the SNP with mean DP <= 2 while FORMAT/DP<=2 will remove the site if any sample with DP <=2.
Hope it helps.
Cheers
Vu

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/SN6PR11MB2557165E6C478E81146F8BBFA7F49%40SN6PR11MB2557.namprd11.prod.outlook.com.
Reply all
Reply to author
Forward
0 new messages