Thanks Kevin, this is very helpful.
I am interested in this library because it uses Cython and thus has greater speed than a pure Python library for parsing VCF files with _many_ samples. However, I am currently encountering two issues with the files that I can typically work with using the the PyVcf library [1].
I am wondering if the pysam developers could comment on these, as I imagine they are rational decisions.
(1) It seems that the parsing of the genotypes does not tolerate cases where the format is GT:AD:DP:GQ:PL, yet the VCF producer provides null genotypes as merely "./.". This poses a bit of a problem for me, as many of my VCF files have been generated by GATK, which often violate this rule. Has anyone considered a "relaxed" option that allows this common issue?
My code is as follows:
#!/usr/bin/env python
import sys
import pysam
vcf = pysam.VCF()
vcf.connect(sys.argv[1])
for var in vcf.fetch():
print var.contig, var.pos, ",".join([str(var[s]['GT']) for s in var.samples])
(2) The pysam VCF parser seems to have an issue with the header FORMAT definitions of VCF files generated by both GATK and FreeBayes.
I have tested the above code on two files:
The errors I receive are:
Line 2: '##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">'
Error BADLY_FORMATTED_FORMAT_STRING: Formatting error in the format string
Traceback (most recent call last):
File "test.py", line 7, in <module>
vcf.connect(sys.argv[1])
File "cvcf.pyx", line 982, in cvcf.VCF.connect (pysam/cvcf.c:19709)
File "cvcf.pyx", line 856, in cvcf.VCF._parse_header (pysam/cvcf.c:17303)
File "cvcf.pyx", line 511, in cvcf.VCF.parse_header (pysam/cvcf.c:9401)
File "cvcf.pyx", line 394, in cvcf.VCF.parse_format (pysam/cvcf.c:6414)
File "cvcf.pyx", line 334, in cvcf.VCF.error (pysam/cvcf.c:5021)
ValueError: Formatting error in the format string
and
Line 6: '##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">'
Error BADLY_FORMATTED_FORMAT_STRING: Formatting error in the format string
Traceback (most recent call last):
File "test.py", line 7, in <module>
vcf.connect(sys.argv[1])
File "cvcf.pyx", line 982, in cvcf.VCF.connect (pysam/cvcf.c:19709)
File "cvcf.pyx", line 856, in cvcf.VCF._parse_header (pysam/cvcf.c:17303)
File "cvcf.pyx", line 511, in cvcf.VCF.parse_header (pysam/cvcf.c:9401)
File "cvcf.pyx", line 394, in cvcf.VCF.parse_format (pysam/cvcf.c:6414)
File "cvcf.pyx", line 334, in cvcf.VCF.error (pysam/cvcf.c:5021)
ValueError: Formatting error in the format string
Looking at the spec [2], I am unable to tell what is wrong with these lines. Could one of the developers comment on this? I am sure it is something obvious that I am missing.
All the best,
Aaron