warning message while using a vcf file as input file for "populations" fstat analyses

725 views
Skip to first unread message

Carol

unread,
Oct 2, 2018, 4:49:06 AM10/2/18
to Stacks
Hi all,

Has anyone tried to estimate fstats with the populations module using a vcf file as input?

This is the command I used:
/home/buitracn/RADseq/tools/stacks-2.0/populations --in_vcf ./spis.296.all.filtered.markers.recode.vcf  -M pop_map_spis296 -O ./ --fstats --fst_correction p_value --vcf_haplotypes -t 40

While populations was loading the info from the vcf file I got several Error and Warning messages such as:

Error: AD field is missing.
Warning: Malformed sample field '0/0:17:.:13:-0.06,-0.96,-1.61'.
Warning: Discarding VCF SNP record 'Spis.scaffold645|size163153:117567'.

by the end of the analysis, I got this:

Found 21272 SNP records in file './spis.296.all.filtered.markers.recode.vcf'. (Skipped 0 already filtered-out SNPs and 0 non-SNP records ; more with --verbose.)

Removed 0 loci that did not pass sample/population constraints from 21272 loci.
Kept 21133 loci, composed of 21133 sites; 0 of those sites were filtered, 21133 variant sites remained.
    21133 genomic sites, of which 0 were covered by multiple loci (0.0%).
Mean genotyped sites per locus: 1.00bp (stderr 0.00).


What happened to those 139 SNPs/loci that are not reported?

Can someone advise on how to interpret this results

Sincerely,

Carol


Han Xiao

unread,
Feb 22, 2020, 10:47:09 AM2/22/20
to Stacks
Hi Carol,

I just got the same error and I am wondering if you find any reasons or solutions regarding this. Thank you very much!

Best regards,
Han Xiao

在 2018年10月2日星期二 UTC上午8:49:06,Carol写道:

Julian Catchen

unread,
Feb 24, 2020, 11:09:48 AM2/24/20
to stacks...@googlegroups.com, Han Xiao, carol.b...@gmail.com
Hi Carol and Han Xiao,

The message is to be taken literally. The input VCF file appears to have
malformed entries that do not include "AD" or allele depth field for
that SNP, which populations expects to be present. I am only guessing
that some external software that was applied to the VCF changed/broke
these fields.

Without knowing your process to get the original VCF file, and without
the VCF file to see how the header was encoded, versus the individual
fields (some of which appear to be missing), I can't say more.

julian

Han Xiao wrote on 2/22/20 9:47 AM:

Han Xiao

unread,
Feb 24, 2020, 12:22:29 PM2/24/20
to Stacks
Hi Julian,

Thanks for your reply!

So for my case, I first got a vcf file from populations and then filtered it in vcftools for individual missingness. Then I have the output recoded vcf files in populations again to generate a genepop file. By the way, surprisingly after applying r and p to filter the loci according to populations, I still got certain individuals with very high missingness but kept in the dataset. For example, one sample has only two loci but both of them were sequenced well.  

Cheers,
Han

在 2020年2月24日星期一 UTC下午4:09:48,Julian Catchen写道:

Julian Catchen

unread,
Feb 24, 2020, 2:23:52 PM2/24/20
to stacks...@googlegroups.com, Han Xiao
Hi Han,

VCFtools anecdotally seems to drop some of the fields from the VCF file.
Another approach is to see which loci VCFtools would exclude and instead
of filtering them, just create a black- or whitelist to tell populations
which loci you explicitly want to exclude.

From your message, it is not clear to me why you can't filter on
'missingness' with populations directly. Perhaps you can explain your
dataset and the filters you are applying that don't give the result you
expect.

julian

Han Xiao wrote on 2/24/20 11:22 AM:

Han Xiao

unread,
Feb 25, 2020, 5:46:42 AM2/25/20
to Stacks
Hi Julian,

Thanks a lot for your quick reply and yes I can also make the black/whitelist to do so.

To make the filtering and missingness part more clearly, I did filter my sample with -r 
and -P and they all work. But my point is these options are for loci but not for individuals.
It happens to me (I guess to some others if they check), some individuals may have very
high missingness and we can not use Stacks to check it.

For my case, I have four morphs of Arctic charr in the same lake as 4 populations I chose 
-r as 0.66 and -P as 4, which all works fine. However, when I check my dataset in vcftools,
I found that I have 1 individual contains only 2 loci instead of 2000+. These 2 loci apparently
passed my filtering and the coverage of these 2 loci are super good therefore I didn't filter 
this individual out when I check the coverage in the beginning.

As I read from some people's suggestions that RADSeq tend to have higher missingness 
compare to WGS but normally 25% of missingness will be fine. However no one has ever
tested the impact of different individual missingness.It will be nice to have your opnions!

Best regards,
Han Xiao

在 2020年2月24日星期一 UTC下午7:23:52,Julian Catchen写道:

Julian Catchen

unread,
Feb 25, 2020, 10:23:11 AM2/25/20
to stacks...@googlegroups.com, Han Xiao
In this case, just drop the individuals you want to remove from the
analysis from the population map and skip re-exporting/importing the VCF
file. -julian

Han Xiao wrote on 2/25/20 4:46 AM:

Linda Lait

unread,
Mar 1, 2023, 6:49:57 PM3/1/23
to Stacks
Sorry to readdress this issue - I am trying to use a vcf file as input for populations and am getting the same error as here. This is happening even if I use the populations.snps.vcf file that was output from populations (I tested this to make sure it wasn't just vcftools introducing the error). I am trying to do an iterative filter, remove individuals and SNPs, and then whitelist SNPs with the highest maf per locus, and then use populations to calculate fstats and create a structure file. I see a possible way to do this by going identifying the SNPs of interest and whitelisting the individuals, but is there some reason why the vcf file is doing this? It wasn't doing this a month ago and I don't think I've changed anything except for splitting up the alignment, gstacks, and populations into different runs. I have tried this with stacks 2.53 and 2.60 and both are producing the same error.

error:
Error: AD field is missing.
Warning: Malformed sample field '0/0:1:.:29:-0.00,-2.28,-2.38'.
Warning: Discarding VCF SNP record 'CM022157.1:56064'.

script (for gstacks and populations)
initial run:
gstacks -I /scratch/llait/chickadees/all4/aligned -O /scratch/llait/chickadees/all4/refstacks -M /scratch/llait/chickadees/all4/allpops.txt -t 4
populations -P /scratch/llait/chickadees/all4/refstacks -M /scratch/llait/chickadees/all4/allpops.txt -R 0.5 --vcf --structure --fstats --hwe -t 4

for second round:
populations -V /scratch/llait/chickadees/all4/refstacks/populations.snps.vcf -M /scratch/llait/chickadees/all4/allpops.txt -O /scratch/llait/chickadees/all4/refstacks/filtered -W whitelist.txt --vcf --genepop --structure --plink --phylip --fstats --hwe -t 4

Any suggestions would be appreciated!

Julian Catchen

unread,
Mar 6, 2023, 6:08:22 PM3/6/23
to Stacks

Hi, It took a very long time to track down this bug but I think I have a fix for this in Stacks 2.64, which I just released over the weekend. Give it a try and see if the error message is gone. That said, the reason these SNPs are in the VCF appears to occur in datasets with low coverage in areas, or with very poor fit to a reference genome, which results in sites with few (if any) reads to support a particular genotype call. If your dataset falls in this category, you might consider upping the genotyping alpha flag in gstacks to preemptively remove these calls from the data set (--gt-alpha 0.01; requires more reads/evidence to make a particular genotype call), though you will have fewer SNPs in the resulting dataset.

 

Please let me know if this error message is removed with Stacks 2.64. 

 

Best,

 

julian

Linda Lait

unread,
Mar 12, 2023, 1:06:08 AM3/12/23
to Stacks
Hi Julian.

Thanks for looking into this and fixing it so quickly!

I just ran my files with stacks/2.64 and it seems to work now (I ran populations both on the vcf file produced directly by populations and on one produced by vcf tools - no more error).

Thank you again for your help!

Best,
Linda
Reply all
Reply to author
Forward
0 new messages