Using pysam.VCF

Aaron Quinlan

unread,

Feb 22, 2012, 9:42:14 AM2/22/12

to pysam-us...@googlegroups.com

Hi all,

Does anyone have a working example that uses the pysam.VCF class to iterate through every record in a VCF file? I am struggling a bit to understand the API.

Best,
Aaron

Kevin Jacobs <jacobs@bioinformed.com>

unread,

Feb 22, 2012, 11:09:04 AM2/22/12

to pysam-us...@googlegroups.com

Hi Aaron,

Here is a working example of using the pysam.VCF directly on a HapMap VCF.gz (w/ Tabix index) file:

from pysam import VCF

vcf = VCF()
vcf.connect('EUR.UMICH.20100804.genotypes.vcf.gz')

for record in vcf.fetch('5'):

print record

If you want something simpler, take a look at my basic VCF parser at:

http://code.google.com/p/glu-genetics/source/browse/glu/lib/seqlib/vcf.py

-Kevin

Aaron Quinlan

unread,

Feb 22, 2012, 11:38:57 AM2/22/12

to pysam-us...@googlegroups.com

Thanks for the quick response Kevin,

The connect method was the missing link for me. How does one access the individual genotypes for the samples in each record?

Regards,

Aaron

Kevin Jacobs <jacobs@bioinformed.com>

unread,

Feb 22, 2012, 12:18:25 PM2/22/12

to pysam-us...@googlegroups.com

Samples are available via:

vcf.getsamples()

and genotype records via a dict-like interface:

>>> print record['NA06984']

{'GD': [0.95], 'GT': [[0, '|', 1]], 'GQ': [25], 'DP': [1], 'PL': [27, 0, 50]}

GT=genotype, DP=depth, etc...

-Kevin

Aaron Quinlan

unread,

Feb 22, 2012, 2:11:32 PM2/22/12

to pysam-us...@googlegroups.com

Thanks Kevin, this is very helpful.

I am interested in this library because it uses Cython and thus has greater speed than a pure Python library for parsing VCF files with _many_ samples. However, I am currently encountering two issues with the files that I can typically work with using the the PyVcf library [1].

I am wondering if the pysam developers could comment on these, as I imagine they are rational decisions.

(1) It seems that the parsing of the genotypes does not tolerate cases where the format is GT:AD:DP:GQ:PL, yet the VCF producer provides null genotypes as merely "./.". This poses a bit of a problem for me, as many of my VCF files have been generated by GATK, which often violate this rule. Has anyone considered a "relaxed" option that allows this common issue?

My code is as follows:

#!/usr/bin/env python

import sys

import pysam

vcf = pysam.VCF()

vcf.connect(sys.argv[1])

for var in vcf.fetch():

print var.contig, var.pos, ",".join([str(var[s]['GT']) for s in var.samples])

The error returned is: Error ERROR_FORMAT_NOT_NUMERICAL: Expected integer or float in formatted field; got [None], which comes from line 608 of http://code.google.com/p/pysam/source/browse/pysam/cvcf.pyx

(2) The pysam VCF parser seems to have an issue with the header FORMAT definitions of VCF files generated by both GATK and FreeBayes.

I have tested the above code on two files:

https://github.com/jamescasbon/PyVCF/blob/master/test/freebayes.vcf

https://github.com/jamescasbon/PyVCF/blob/master/test/gatk.vcf

The errors I receive are:

Line 2: '##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">'

Error BADLY_FORMATTED_FORMAT_STRING: Formatting error in the format string

Traceback (most recent call last):

File "test.py", line 7, in <module>

vcf.connect(sys.argv[1])

File "cvcf.pyx", line 982, in cvcf.VCF.connect (pysam/cvcf.c:19709)

File "cvcf.pyx", line 856, in cvcf.VCF._parse_header (pysam/cvcf.c:17303)

File "cvcf.pyx", line 511, in cvcf.VCF.parse_header (pysam/cvcf.c:9401)

File "cvcf.pyx", line 394, in cvcf.VCF.parse_format (pysam/cvcf.c:6414)

File "cvcf.pyx", line 334, in cvcf.VCF.error (pysam/cvcf.c:5021)

ValueError: Formatting error in the format string

and

Line 6: '##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">'

Error BADLY_FORMATTED_FORMAT_STRING: Formatting error in the format string

Traceback (most recent call last):

File "test.py", line 7, in <module>

vcf.connect(sys.argv[1])

File "cvcf.pyx", line 982, in cvcf.VCF.connect (pysam/cvcf.c:19709)

File "cvcf.pyx", line 856, in cvcf.VCF._parse_header (pysam/cvcf.c:17303)

File "cvcf.pyx", line 511, in cvcf.VCF.parse_header (pysam/cvcf.c:9401)

File "cvcf.pyx", line 394, in cvcf.VCF.parse_format (pysam/cvcf.c:6414)

File "cvcf.pyx", line 334, in cvcf.VCF.error (pysam/cvcf.c:5021)

ValueError: Formatting error in the format string

Looking at the spec [2], I am unable to tell what is wrong with these lines. Could one of the developers comment on this? I am sure it is something obvious that I am missing.

All the best,

Aaron

[1] https://github.com/jamescasbon/PyVCF

[2] http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Kevin Jacobs <jacobs@bioinformed.com>

unread,

Feb 22, 2012, 3:51:15 PM2/22/12

to pysam-us...@googlegroups.com

On Wed, Feb 22, 2012 at 2:11 PM, Aaron Quinlan <aaronq...@gmail.com> wrote:

Thanks Kevin, this is very helpful.

I am interested in this library because it uses Cython and thus has greater speed than a pure Python library for parsing VCF files with _many_ samples. However, I am currently encountering two issues with the files that I can typically work with using the the PyVcf library [1].

Hi Aaron,

My own very simple VCF library is pretty fast, makes very few assumptions about the VCF data itself, even for files with large numbers of samples. So long as you don't need Tabix indexing, you may want to give it a try, as the record parsing is very lightweight and shouldn't be too far behind the pysam.VCF implementation. I also know that it works with GATK and FreeBayes output, since I use them frequently. If you need tabix indexing, I can add that functionality in about 10 minutes.

-Kevin

Andreas

unread,

Feb 24, 2012, 3:03:09 AM2/24/12

to Pysam User group

Hi all,

VCF support in pysam is still in its early days. It is a collaboration
between us (CGAT) and
Andy Rimmer/Gerton Lunter at the WTCHG (UK). From our end we brought
random access
and lazy parsing through the tabix module. The lazy parsing is useful
if you have a VCF file
with 100s of columns and you are only interested in a subset.

Andy and Gerton have written a validating VCF parser which supports
several but possibly not
all VCF versions and dialects.

I see that you have submitted a bug report - thanks - I will need some
time to go through this.

Thanks,
Andreas

On Feb 22, 8:51 pm, "Kevin Jacobs <jac...@bioinformed.com>"
<bioinfor...@gmail.com> wrote:

Aaron Quinlan

unread,

Feb 28, 2012, 9:11:27 AM2/28/12

to pysam-us...@googlegroups.com

Thanks Kevin.

I will give it a try and see how it scales for files with 1000s of samples.

Best,

Aaron

Reply all

Reply to author

Forward