VCF haploid to diploid

856 views
Skip to first unread message

Elisabet Sjokvist

unread,
Oct 17, 2017, 6:17:22 AM10/17/17
to ashworth-c...@googlegroups.com
Hi all,

I have a VCF file for a haploid organism, but want to try out the R package PopGenome which works on diploid organisms. Does anyone know of a way to transform a haploid VCF to diploid?

Cheers,
Elisabet

Kashyap Chhatbar

unread,
Oct 17, 2017, 6:31:29 AM10/17/17
to ashworth-code-monkeys
Hi Elisabet,

I have people asking to make VCF diploid from haploid, but this is reverse. Here's stuff from the PopGenome reference manual, don't know whether it'll work as desired. Hope this helps, and if it works you don't need to convert anything.

The readVCF function has the following details:

When approx=TRUE, the algorithm will apply a logical OR to the GT-field: (0|0=0,1|0=1,0|1=1,1|1=1). Note, this is an approximation for diploid data, which will speed up calculations. In case of haploid data, approx should be switched to TRUE. If approx=FALSE, the full diploid information will be considered.

--
The wiki is at:
https://www.wiki.ed.ac.uk/display/AshCodes/Ashworth+Codemonkeys
The mailing list archive is at:
https://groups.google.com/forum/?fromgroups#!forum/ashworth-code-monkeys
If you have trouble editing the wiki or emailing the group, let me know: sujai...@ed.ac.uk
---
You received this message because you are subscribed to the Google Groups "Ashworth Codemonkeys" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ashworth-code-monkeys+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Kashyap Chhatbar

Elisabet Sjokvist

unread,
Oct 18, 2017, 8:20:52 AM10/18/17
to ashworth-c...@googlegroups.com
Hi Kashyap,

Here's what I find on readVCF

"The readVCF function expects a tabixed VCF file with a diploid GT field.
In case of haploid data, the GT field has to be transformed to a pseudo-diploid
field (such as 0 -> 0|0). An alternative is to use readData(..., format="VCF"),
which can read non-tabixed haploid and any kind of polyploid VCFs directly.
When approx=TRUE, the algorithm will apply a logical OR to the GT-field:
(0|0=0,1|0=1,0|1=1,1|1=1). Note, this is an approximation for diploid data, which will
speed up calculations. In case of haploid data, approx should be switched to TRUE.
If approx=FALSE, the full diploid information will be considered.
The ff-package PopGenome uses to store the SNP information limits total data size to
individuals * (number of SNPs) <= .Machine$integer.max
In case of very large data sets, the bigmemory package will be used;
this will slow down calculations (e.g. this package have to be installed first !!!).
Use the function vcf_handle <-.Call("VCF_open", filename)
to open a VCF-file and .Call("VCF_getSampleNames",vcf_handle)
to get and define the individuals which should be considered in the analysis.
See also readData(..., format="VCF") !"
 
So even if the data is converted to pseudo-haploid by the readVCF, the data has do be diploid to begin with, quite silly. THe other option is to use readData, but that seems less flexible, and also results in a crash.

Thanks for the input though.

Elisabet




To unsubscribe from this group and stop receiving emails from it, send an email to ashworth-code-monkeys+unsubscri...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Kashyap Chhatbar

Chen-Tau Lin

unread,
Feb 23, 2019, 3:04:23 AM2/23/19
to Ashworth Codemonkeys

VCF_split_into_scaffolds("input.vcf", "directory")

df <- readData("directory", format="VCF", include.unknown=TRUE)

Elisabet Sjokvist於 2017年10月18日星期三 UTC+8下午8時20分52秒寫道:
To unsubscribe from this group and stop receiving emails from it, send an email to ashworth-code-monkeys+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Kashyap Chhatbar

Reply all
Reply to author
Forward
0 new messages