SNP data formatting

673 views
Skip to first unread message

Marc Beer

unread,
Apr 21, 2020, 5:27:10 PM4/21/20
to migrate-support
Hi,

I am attempting to reformat diploid SNP data into an acceptable format for Migrate. I have tried converting a number of file formats (e.g., vcf, genepop) to Migrate "N" format using PGDSpider. Irrespective of input format, the returned file has homozygotes coded using a single allele (e.g., G/G individuals are coded as G for a given locus), while heterozygotes are coded using IUPAC genotype codes (e.g.,  A/G individuals are coded as R). Migrate subsequently identifies R as an additional allele instead of correctly interpreting it as an A/G heterozygote. I have additionally tried the R function vcfR2migrate (package vcfR), but this function removes loci with missing data, which is not compatible with my dataset (ddRADseq data), unless I were to impute missing genotypes.

Does anyone know of additional file format converters that correctly generate Migrate format files while allowing missing data?

Best wishes,
Marc Beer

Peter Beerli

unread,
Apr 21, 2020, 5:36:11 PM4/21/20
to migrate...@googlegroups.com
Marc,
I have a python converter VCF2migrate close to be finished (it surely will be deficient in many parts, but despite these shortcomings, I guess, this is what you want)
it will need a reference sequence (or multiple for different loci/chromosomes) and a VCF file to convert haploids or diploids to a migrate output file  (it will not be able to understand all the details of the full VCF).
I can finalize the code soon using some “foreign” examples, so if you have a VCF file with some samples in it, I would be happy to test. I would need a VCF file and reference sequence, if the files are huge (e.g. like for the 1000 human genomes then it will break). I hope to release the combo mig2vcf.py and vcf2mig.py in a few weeks.

Peter



--
You received this message because you are subscribed to the Google Groups "migrate-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to migrate-suppo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/migrate-support/7de1ccb3-a646-4f0a-9749-88786152edf4%40googlegroups.com.

Marc Beer

unread,
Apr 21, 2020, 5:54:16 PM4/21/20
to migrate-support
Hi Peter,

Thank you for the incredibly fast response. Is there an assumption that the reference belongs to the same species as the data in the vcf? I developed RAD loci [+ called SNPs] using Stacks and a somewhat divergent reference genome. I am relatively new to handling NGS data, so perhaps this is not a potential issue.

Best,
Marc
To unsubscribe from this group and stop receiving emails from it, send an email to migrate...@googlegroups.com.

Peter Beerli

unread,
Apr 21, 2020, 6:08:25 PM4/21/20
to migrate...@googlegroups.com
Marc,
I will use the reference to reconstitute complete sequences and then add the variants from the VCF file, I would assume that the reference should be part of the original data, but since the code does not know that it will just use that as a base, this is probably good enough [but one would need to do some tests to see whether this considerable distorts results). 

After rereading your message, I realize that you want to  extract only snps from a VCF. ( guess that is what PGD spyder is doing. 
I do not believe that snps are not great data for evaluations of population size and gene flow because of the lack of invariant sites that have an impact on the population size estimates. If we exclude the invariants (snps or site frequency spectra) we need other information to convey the frequency of variable sites versus invariants.  This is one of the reasons why I want to write this converter.
Peter




To unsubscribe from this group and stop receiving emails from it, send an email to migrate-suppo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/migrate-support/4088dafa-2cda-453c-a21d-8bd37721096d%40googlegroups.com.

Pedro Henrique Pezzi

unread,
Apr 22, 2020, 3:28:00 PM4/22/20
to migrate-support
Hello Marc and Peter!

I believe I have some similar questions and I would appreciate if you could help me someway. I converted my vcf to migrate format in PGDSpider (part of the infile is attached), but when I run migrate, it crashes and it does not create the PDF file. I believe this is a problem with the infile. I have 4 populations and ~21,000 SNPs. I am also attaching to this answer the output file using default so you can take a look. I understand that this kind of data is not the best, thus I am looking forward to use this new python converter you are creating. 

Thank you so much for your help!

Best,
Pedro
outfile
infile.txt

Marc Beer

unread,
Apr 28, 2020, 6:54:25 AM4/28/20
to migrate-support
Hi Peter and Pedro,

Peter - sorry for the delay in my response - the reference genome I used for developing Rad loci in Stacks is quite large (28Gb; an Ambystomid), so I am not sure it would present a favorable scenario to test your python file converter.

Pedro - I think Peter will have to look at your files, since I am too inexperienced with Migrate to quickly spot any issues. However, I ended up making a bash script that calls on vcftools to create a Migrate "H" infile from a vcf, and the resulting infile seems to work. The script is not quite generalized yet, but I'd be happy to send it your way in a few days, if you are still looking for a solution.

Cheers,
Marc

Pedro Henrique Pezzi

unread,
Apr 28, 2020, 8:41:38 AM4/28/20
to migrate-support
Hello Marc!

Thank you for your reply! I would appreciate it if you could send me the script whenever you can, it would be great! Thank you for your help!

Best,
Pedro

Marc Beer

unread,
Apr 28, 2020, 7:38:13 PM4/28/20
to migrate-support

Hi Pedro,

Attached is the bash script. Most of the changes you will have to make are to file names, and there is a final, manual step not included in the script, which is to add the first line with "H #pops #loci" (tab-separated). Let me know if it gives you any trouble!

Best,
Marc
vcf_to_migrate.sh

Amaranta Fontcuberta

unread,
Apr 29, 2020, 1:28:04 PM4/29/20
to migrate-support
Dear Peter,

I jump in the conversation. It is great news that you are developing a pipeline to convert radseq data in a better format for migrate than SNPs.
I work with data on 1 ant species.
If you are willing to try the pipeline with "foreign" data, I would be happy to send you a vcf of ~12.000 SNPs  and the reference genome  (300Mb), or a piece of it (one of 27 seven chromosomes).

All the best

amaranta.
To unsubscribe from this group and stop receiving emails from it, send an email to migrate...@googlegroups.com.

Peter Beerli

unread,
Apr 29, 2020, 8:18:48 PM4/29/20
to migrate...@googlegroups.com
Pedro,
I finally checked your file, you use single snps but you have  lines like this 
P.infl08..WTTA
in your datafile,
this will lead to frameshift and failure to read your data.
Trying to use Mark’s vcf2hapmap type script is probably better for you anyway

Peter



To unsubscribe from this group and stop receiving emails from it, send an email to migrate-suppo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/migrate-support/53464564-24ee-468c-96d1-7ef208fd6fca%40googlegroups.com.
<outfile><infile.txt>

Peter Beerli

unread,
Apr 29, 2020, 8:28:22 PM4/29/20
to migrate...@googlegroups.com
Marc,
down the road, I am happy to try your large data set, but currently this will simply break the reading routine of migrate, Gigabytes are not yet great input data for migrate, but, in principle, could be made to work, I will write a small grant this summer, because I can envision to be able to read and analyze any size dataset that within a year or so, the only problem then will be that you will need a cluster to analyze things because the underlying machinery will be still not very fast, but this would finally allow to compare the migrate approach to the fast simulation and expectation  systems of momi and other site frequency spectra programs.

Peter

 

To unsubscribe from this group and stop receiving emails from it, send an email to migrate-suppo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/migrate-support/5dac7f97-dc51-433b-8aff-5dc505864ac0%40googlegroups.com.

Peter Beerli

unread,
Apr 29, 2020, 8:28:23 PM4/29/20
to migrate...@googlegroups.com
Amaranta,
I would be happy to try the full data.
Please send a link to download the data to bee...@fsu.edu
thanks
Peter


To unsubscribe from this group and stop receiving emails from it, send an email to migrate-suppo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/migrate-support/7ba47800-4ab3-4129-b6e0-ad596da2b4fe%40googlegroups.com.

Peter Beerli

unread,
Apr 29, 2020, 8:28:45 PM4/29/20
to migrate...@googlegroups.com
Marc,
I will try your script and if it works fine with other vcf data I would be happy to put it into my contribution folder I distribute with the program
[we would need to discuss a few things off the migrate support list —> send me direct email bee...@fsu.edu]

Peter



To unsubscribe from this group and stop receiving emails from it, send an email to migrate-suppo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/migrate-support/6b99f104-37ef-4540-87f5-4b099c3f8a92%40googlegroups.com.
<vcf_to_migrate.sh>

Pedro Henrique Pezzi

unread,
Apr 30, 2020, 12:53:40 PM4/30/20
to migrate-support
Thank you Peter for you help!
I will try using Mark's script and I look forward to using the one you are going to create!

Best,
Pedro
Reply all
Reply to author
Forward
0 new messages