How to convert multi-sample VCF to structure file format ?

6,324 views
Skip to first unread message

geek_y

unread,
Feb 7, 2015, 4:45:42 AM2/7/15
to structure...@googlegroups.com
I have VCF file for multiple sample of a population. How can I use this file in structure software ? What would be the best workflow from NGS raw reads to structure ? I have used Stacks but it has few concerns in how it writes structure file from SNP data.

Vikram Chhatre

unread,
Feb 7, 2015, 6:48:53 AM2/7/15
to structure-software
Great question!  There are two ways to do this:

1. Use vcftools to export to .ped format with the --plink switch.  Then convert the .ped to .bed using Plink.  Faststructure will read the .bed format files (there are three files for each project).  

2. Use pgdSpider to convert from vcf to structure or faststructure. Make sure to run --maf filter in vcftools prior to converting the files, to get rid of any monomorphic loci.  The --min-alleles and --max-alleles, even when both set at a value of 2 do not get rid of monomorphic sites.  The --maf filter (keep it very low) will accomplish this.  If monomorphic sites are retained, pgdSpider is unable to properly convert the data.

PgdSpider needs a popfile containing sample id and pop id columns, and an SPID file which is a simple text parameter file which you can generate from the gui.

I prefer # 2.  Let me know if you have more questions.  

V

On Sat, Feb 7, 2015 at 4:45 AM, geek_y <goutha...@gmail.com> wrote:
I have VCF file for multiple sample of a population. How can I use this file in structure software ? What would be the best workflow from NGS raw reads to structure ? I have used Stacks but it has few concerns in how it writes structure file from SNP data.

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.
To post to this group, send email to structure...@googlegroups.com.
Visit this group at http://groups.google.com/group/structure-software.
For more options, visit https://groups.google.com/d/optout.

geek_y

unread,
Feb 7, 2015, 8:42:00 AM2/7/15
to structure...@googlegroups.com
Thanks for reply Dr. Chhatre.  The advantage of stacks as I see is it can filter data with different percent of missing data and can write the data to different file formats. 

But as I am working with model organism, I am interested to use GATK based pipeline.Is there any way that we can filter multi sample VCF file such that we end up snps that are present at least in n number of samples. And How to handle InDels when converting VCF file ?


On Saturday, 7 February 2015 17:18:53 UTC+5:30, Vikram Chhatre wrote:
Great question!  There are two ways to do this:

1. Use vcftools to export to .ped format with the --plink switch.  Then convert the .ped to .bed using Plink.  Faststructure will read the .bed format files (there are three files for each project).  

2. Use pgdSpider to convert from vcf to structure or faststructure. Make sure to run --maf filter in vcftools prior to converting the files, to get rid of any monomorphic loci.  The --min-alleles and --max-alleles, even when both set at a value of 2 do not get rid of monomorphic sites.  The --maf filter (keep it very low) will accomplish this.  If monomorphic sites are retained, pgdSpider is unable to properly convert the data.

PgdSpider needs a popfile containing sample id and pop id columns, and an SPID file which is a simple text parameter file which you can generate from the gui.

I prefer # 2.  Let me know if you have more questions.  

V
On Sat, Feb 7, 2015 at 4:45 AM, geek_y <goutha...@gmail.com> wrote:
I have VCF file for multiple sample of a population. How can I use this file in structure software ? What would be the best workflow from NGS raw reads to structure ? I have used Stacks but it has few concerns in how it writes structure file from SNP data.

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-software+unsub...@googlegroups.com.

geek_y

unread,
Feb 8, 2015, 11:37:08 PM2/8/15
to structure...@googlegroups.com
The older versions of vcftools has an option to get snps that are present at least in 'n' number of samples. This is used in dDocent pipeline. I will look into it. I am still confused about indels. 

Vikram Chhatre

unread,
Feb 9, 2015, 12:05:57 PM2/9/15
to structure-software
vcftools has site missingness and individual missingness filters which should help with what you want to do.  Also, a google search for 'vcf indels' brings up some information you may be interested in (first couple of hits).

V

To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.

geek_y

unread,
Feb 10, 2015, 7:54:32 AM2/10/15
to structure...@googlegroups.com
Yes got it. vcftools has an option '--max-missing' which helps to set criteria for the missing data. I am using freebayes to call snps and it has an option '--no-indels'. Thanks.

Jennifer Gruhn

unread,
Feb 12, 2015, 12:34:11 AM2/12/15
to structure...@googlegroups.com
Your Option #2 worked so well!!! My eyes had been glazing over all week at the scripts used to convert VCF into Structure files.

Myself and whitebark pine thank you haha =) Saved me days...

Megan Aylward

unread,
Jul 6, 2016, 11:52:37 PM7/6/16
to structure-software, goutha...@gmail.com
I have tried using PGDSpider to convert a vcf file but get the warning message that bases have been converted to integers. So my file contains -9 for missing data and only ints 1-4 for each genotype. Can STRUCTURE accept this format or do you know why this error may be occurring?

Thanks

Vikram Chhatre

unread,
Jul 7, 2016, 8:42:59 AM7/7/16
to structure...@googlegroups.com
This is not an error. Structure will work with integers. 

V


On Wednesday, July 6, 2016, Megan Aylward <aylwar...@gmail.com> wrote:
I have tried using PGDSpider to convert a vcf file but get the warning message that bases have been converted to integers. So my file contains -9 for missing data and only ints 1-4 for each genotype. Can STRUCTURE accept this format or do you know why this error may be occurring?

Thanks

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.
To post to this group, send email to structure...@googlegroups.com.

Megan Aylward

unread,
Jul 7, 2016, 9:25:18 AM7/7/16
to structure-software
Ok great, thanks for clarifying.
Reply all
Reply to author
Forward
0 new messages