CAVA failing for VCFs

Victor

unread,

Jul 9, 2015, 3:51:11 PM7/9/15

to cava-us...@googlegroups.com

I have implemented CAVA and been testing it using several VCFs but CAVA fails for each VCF. The human reference FASTA being used works fine for all other variant callers. Log file is included below. Thank you for your assistance.

Input file contains 1555409 records to annotate.

Annotating variants ... 23.0%Traceback (most recent call last):
File "./cava.py", line 184, in <module>
record.annotate(ensembl,dbsnp,reference,impactdir)
File "/home/cava-v1.0.0/basics.py", line 247, in annotate
variant.annotate(ensembl,dbsnp,reference,impactdir)
File "/home/cava-v1.0.0/basics.py", line 88, in annotate
if not ensembl is None: self=ensembl.annotate(self,reference,impactdir)
File "/home/cava-v1.0.0/data.py", line 199, in annotate
variant_plus=variant.alignOnPlusStrand(reference)
File "/home/cava-v1.0.0/basics.py", line 100, in alignOnPlusStrand
seq1=reference.getReference(self.chrom,self.pos,self.pos+len(self.ref)-1+100)
File "/home/cava-v1.0.0/data.py", line 500, in getReference
seq = self.fastafile.fetch(goodchrom,start-1,end)
File "pysam/cfaidx.pyx", line 182, in pysam.cfaidx.FastaFile.fetch (pysam/cfaidx.c:3371)
ValueError: invalid region: start (198043100) > end (198022430)

Márton Münz

unread,

Jul 9, 2015, 4:00:12 PM7/9/15

to cava-us...@googlegroups.com, thed...@existencehealth.com

Hi Victor,

Could you please attach an example VCF file that fails or send it to me to marto...@well.ox.ac.uk ?

Many thanks.

Best wishes,

Marton

Victor

unread,

Jul 10, 2015, 5:28:01 AM7/10/15

to cava-us...@googlegroups.com, thed...@existencehealth.com

Hi Marton - thank you for your response. One of the VCF files can be downloaded from this url: https://www.dropbox.com/s/vschxlrnhc688hh/2003_1.Ann.vcf.gz?dl=0

Thought the log from a subsequent run (that failed) may also be helpful - i included it below. Thank you for all your assistance.

here is some another run (log is a bit of different):

Input file contains 1555409 records to annotate.

Annotating variants ... 74.6%Traceback (most recent call last):

File "./cava.py", line 184, in <module>

record.annotate(ensembl,dbsnp,reference,impactdir)

File "/home/cava-v1.0.0/basics.py", line 247, in annotate

variant.annotate(ensembl,dbsnp,reference,impactdir)

File "/home/cava-v1.0.0/basics.py", line 88, in annotate

if not ensembl is None: self=ensembl.annotate(self,reference,impactdir)

File "/home/cava-v1.0.0/data.py", line 284, in annotate

csn_plus=csn.getAnnotation(variant_plus,transcript,reference)

File "/home/cava-v1.0.0/csn.py", line 68, in getAnnotation

protein = makeProteinString(variant,transcript,reference)

File "/home/cava-v1.0.0/csn.py", line 172, in makeProteinString

prot=transcript.getProteinSequence(reference,None)

File "/home/cava-v1.0.0/basics.py", line 443, in getProteinSequence

ret=Sequence(self.getCodingSequence(reference,variant)).translate(1)

File "/home/cava-v1.0.0/basics.py", line 414, in getCodingSequence

ret+=reference.getReference(self.chrom,exon.start+1,exon.end)

File "/home/cava-v1.0.0/data.py", line 500, in getReference

seq = self.fastafile.fetch(goodchrom,start-1,end)

File "pysam/cfaidx.pyx", line 182, in pysam.cfaidx.FastaFile.fetch (pysam/cfaidx.c:3371)

ValueError: invalid region: start (114426046) > end (114364328)

Márton Münz

unread,

Jul 10, 2015, 9:51:08 AM7/10/15

to cava-us...@googlegroups.com, thed...@existencehealth.com

Hi Victor,

Thanks for sending me the VCF!

In this particular VCF, the problem seems to be caused by the variant at the site chr3:198043089

(chr3 198043089 . GATAAATAAATA GATAAATAAATAAATA).

The total length of chromosome 3 in the GRCh37/hg19 reference genome is 198022430 bases, so I'm wondering how is it possible that your variant is located in a position that is not on the chromosome (i.e. more than 20kb 'downstream' the last base of chr3)? This variant causes CAVA to return the error message you got.

Please let me know what you think.

Thanks! Best wishes,

Marton

Victor

unread,

Jul 10, 2015, 2:50:19 PM7/10/15

to cava-us...@googlegroups.com, thed...@existencehealth.com

Hi Marton,

Thank you for identifying this. I'm actually using GRCh38.p3 as the reference genome. I see now that the default provided is @ensembl = exome_65_GRCh37.gz.

Are there existing files that can be obtained for GRCh38.gz (and also for GRCh37.gz) that can be used for the @emsembl configuration option for hg38 or hg17 reference genomes?

Please let me know if any other configuration changes are required when switching the configuration to use a different human reference genome.

Thank you again for your all your assistance, it is greatly appreciated!

Márton Münz

unread,

Jul 11, 2015, 5:58:52 AM7/11/15

to cava-us...@googlegroups.com, thed...@existencehealth.com

Hi Victor,

Yes, CAVA uses GRCh37 as default. However you can create any arbitrary transcript database using the included dbprep tool which allows you to specify the Ensembl release and reference genome.

For example, to create a transcript database using Ensembl release 67 and reference genome GRCh37:

python dbprep.py -e 67 -o output

To create a transcript database based on Ensembl release 78 and reference GRCh38:

python dbprep.py -e 78 -g GRCh38 -o output

An output file output.gz (and its index file output.gz.tbi) will be generated that you can directly include at the @ensembl configuration option.

The above commands will retrieve all transcripts from the Ensembl database. Alternatively, you can include only a custom list of transcripts by using option -i. For example:

python dbprep.py -i transcripts.txt -e 78 -g GRCh38 -o output

where transcripts.txt contains your custom list of Ensembl IDs (each ID in different line) - please see examples in the documentation (cava-v1.0.0-doc.pdf).

Note that CAVA was tested only for GRCh37 but I believe it should work with GRCh38 as well.

If you have further question, please let me know!

Best wishes,

Marton

Victor

unread,

Jul 14, 2015, 12:41:00 PM7/14/15

to cava-us...@googlegroups.com, thed...@existencehealth.com

Thank you. CAVA is now working without issue.

Márton Münz

unread,

Jul 14, 2015, 1:45:57 PM7/14/15

to cava-us...@googlegroups.com, thed...@existencehealth.com

Great. Thanks for the feedback.

Victor

unread,

Jul 15, 2015, 1:09:13 PM7/15/15

to cava-us...@googlegroups.com, thed...@existencehealth.com

Hi Marton,

One last question - I attempted to configure CAVA to work with hg18 but had a difficult time finding dbsnp 129. I did locate files such as dbsnp_137.hg18.excluding_sites_after_129.vcf.gz, which should be the same as dbsnp 129, but when I attempted to use this file I received the following error from CAVA:

root@8810f65bd417:/mnt/data/refData# /home/cava-v1.0.0/dbprep.py -g hg18 -s 129 -d /mnt/data/refData/dbsnp_137.hg18.excluding_sites_after_129.vcf -o cavaDbSNP129NCBI36

-----------------------------------------------------------------

CAVA v1.0.0 dbprep (database preparation script) is now running.

Started: 2015-07-14 17:05:17.227135

dbSNP build: 129

All SNPs are to be retrieved!

Database: /mnt/data/refData/dbsnp_137.hg18.excluding_sites_after_129.vcf

Searching database...

Traceback (most recent call last):

File "/home/cava-v1.0.0/dbprep.py", line 403, in <module>

createFile(options)

File "/home/cava-v1.0.0/dbprep.py", line 31, in createFile

generateSNPDB(options)

File "/home/cava-v1.0.0/dbprep.py", line 224, in generateSNPDB

build=info[:info.index(';')]

ValueError: substring not found

The dbsnp files can be found here:

https://test.galaxyproject.org/library_common/download_dataset_from_folder?library_id=9c6348e21b1b5727&show_deleted=False&cntrller=library&use_panels=False&id=b853402cd242cbb5

ftp://gsapubftp...@ftp.broadinstitute.org/bundle/2.5/hg18/dbsnp_137.hg18.excluding_sites_after_129.vcf.gz

Thank you!

V

Márton Münz

unread,

Jul 20, 2015, 6:13:21 PM7/20/15

to cava-us...@googlegroups.com, thed...@existencehealth.com

Hi Victor,

Sorry for the delay!

The input file you want to use has a slightly different format than the format the dbprep tool was made for. However, it was very easy to tweak the code so that now it can read your file. Please use the attached dbprep.py tool and simply run:

"python dbprep.py -s 129 -d dbsnp_137.hg18.excluding_sites_after_129.vcf -o out"

This should generate the output you want.

Note this code will only be able to handle the particular file you attached.

Best,

Marton

On Thursday, July 9, 2015 at 9:51:11 PM UTC+2, Victor wrote:

dbprep.py

Victor

unread,

Jul 24, 2015, 2:48:40 PM7/24/15

to CAVA User Group, munz...@gmail.com

Marton - thank you again, this worked perfectly!

Reply all

Reply to author

Forward