CAVA failing for VCFs

75 views
Skip to first unread message

Victor

unread,
Jul 9, 2015, 3:51:11 PM7/9/15
to cava-us...@googlegroups.com

I have implemented CAVA and been testing it using several VCFs but CAVA fails for each VCF. The human reference FASTA being used works fine for all other variant callers. Log file is included below. Thank you for your assistance.


Input file contains 1555409 records to annotate.


Annotating variants ... 23.0%Traceback (most recent call last):
File "./cava.py", line 184, in <module>
record.annotate(ensembl,dbsnp,reference,impactdir)
File "/home/cava-v1.0.0/basics.py", line 247, in annotate
variant.annotate(ensembl,dbsnp,reference,impactdir)
File "/home/cava-v1.0.0/basics.py", line 88, in annotate
if not ensembl is None: self=ensembl.annotate(self,reference,impactdir)
File "/home/cava-v1.0.0/data.py", line 199, in annotate
variant_plus=variant.alignOnPlusStrand(reference)
File "/home/cava-v1.0.0/basics.py", line 100, in alignOnPlusStrand
seq1=reference.getReference(self.chrom,self.pos,self.pos+len(self.ref)-1+100)
File "/home/cava-v1.0.0/data.py", line 500, in getReference
seq = self.fastafile.fetch(goodchrom,start-1,end)
File "pysam/cfaidx.pyx", line 182, in pysam.cfaidx.FastaFile.fetch (pysam/cfaidx.c:3371)
ValueError: invalid region: start (198043100) > end (198022430)

Márton Münz

unread,
Jul 9, 2015, 4:00:12 PM7/9/15
to cava-us...@googlegroups.com, thed...@existencehealth.com
Hi Victor, 
Could you please attach an example VCF file that fails or send it to me to marto...@well.ox.ac.uk ?
Many thanks.
Best wishes,
Marton

Victor

unread,
Jul 10, 2015, 5:28:01 AM7/10/15
to cava-us...@googlegroups.com, thed...@existencehealth.com
Hi Marton - thank you for your response. One of the VCF files can be downloaded from this url: https://www.dropbox.com/s/vschxlrnhc688hh/2003_1.Ann.vcf.gz?dl=0

Thought the log from a subsequent run (that failed) may also be helpful - i included it below. Thank you for all your assistance.

here is some another run (log is a bit of different):


Input file contains 1555409 records to annotate.


Annotating variants ... 74.6%Traceback (most recent call last):

  File "./cava.py", line 184, in <module>

    record.annotate(ensembl,dbsnp,reference,impactdir)

  File "/home/cava-v1.0.0/basics.py", line 247, in annotate

    variant.annotate(ensembl,dbsnp,reference,impactdir)

  File "/home/cava-v1.0.0/basics.py", line 88, in annotate

    if not ensembl is None: self=ensembl.annotate(self,reference,impactdir)

  File "/home/cava-v1.0.0/data.py", line 284, in annotate

    csn_plus=csn.getAnnotation(variant_plus,transcript,reference)

  File "/home/cava-v1.0.0/csn.py", line 68, in getAnnotation

    protein = makeProteinString(variant,transcript,reference)

  File "/home/cava-v1.0.0/csn.py", line 172, in makeProteinString

    prot=transcript.getProteinSequence(reference,None)

  File "/home/cava-v1.0.0/basics.py", line 443, in getProteinSequence

    ret=Sequence(self.getCodingSequence(reference,variant)).translate(1)

  File "/home/cava-v1.0.0/basics.py", line 414, in getCodingSequence

    ret+=reference.getReference(self.chrom,exon.start+1,exon.end)

  File "/home/cava-v1.0.0/data.py", line 500, in getReference

    seq = self.fastafile.fetch(goodchrom,start-1,end)

  File "pysam/cfaidx.pyx", line 182, in pysam.cfaidx.FastaFile.fetch (pysam/cfaidx.c:3371)

ValueError: invalid region: start (114426046) > end (114364328)

Márton Münz

unread,
Jul 10, 2015, 9:51:08 AM7/10/15
to cava-us...@googlegroups.com, thed...@existencehealth.com
Hi Victor,

Thanks for sending me the VCF! 
In this particular VCF, the problem seems to be caused by the variant at the site chr3:198043089
(chr3  198043089  .  GATAAATAAATA  GATAAATAAATAAATA).
The total length of chromosome 3 in the GRCh37/hg19 reference genome is 198022430 bases, so I'm wondering how is it possible that your variant is located in a position that is not on the chromosome (i.e. more than 20kb 'downstream' the last base of chr3)? This variant causes CAVA to return the error message you got.
Please let me know what you think.
Thanks! Best wishes,
Marton

Victor

unread,
Jul 10, 2015, 2:50:19 PM7/10/15
to cava-us...@googlegroups.com, thed...@existencehealth.com
Hi Marton,

Thank you for identifying this. I'm actually using GRCh38.p3 as the reference genome. I see now that the default provided is @ensembl = exome_65_GRCh37.gz.

Are there existing files that can be obtained for GRCh38.gz (and also for GRCh37.gz) that can be used for the @emsembl configuration option for hg38 or hg17 reference genomes?

Please let me know if any other configuration changes are required when switching the configuration to use a different human reference genome.

Thank you again for your all your assistance, it is greatly appreciated!

Márton Münz

unread,
Jul 11, 2015, 5:58:52 AM7/11/15
to cava-us...@googlegroups.com, thed...@existencehealth.com
Hi Victor,
Yes, CAVA uses GRCh37 as default. However you can create any arbitrary transcript database using the included dbprep tool which allows you to specify the Ensembl release and reference genome. 
For example, to create a transcript database using Ensembl release 67 and reference genome GRCh37:
python dbprep.py -e 67 -o output
To create a transcript database based on Ensembl release 78 and reference GRCh38:
python dbprep.py -e 78 -g GRCh38 -o output

An output file output.gz (and its index file output.gz.tbi) will be generated that you can directly include at the @ensembl configuration option.

The above commands will retrieve all transcripts from the Ensembl database. Alternatively, you can include only a custom list of transcripts by using option -i. For example:
python dbprep.py -i transcripts.txt -e 78 -g GRCh38 -o output
where transcripts.txt contains your custom list of Ensembl IDs (each ID in different line) - please see examples in the documentation (cava-v1.0.0-doc.pdf).

Note that CAVA was tested only for GRCh37 but I believe it should work with GRCh38 as well.
If you have further question, please let me know!
Best wishes,
Marton

Victor

unread,
Jul 14, 2015, 12:41:00 PM7/14/15
to cava-us...@googlegroups.com, thed...@existencehealth.com
Thank you. CAVA is now working without issue.

Márton Münz

unread,
Jul 14, 2015, 1:45:57 PM7/14/15
to cava-us...@googlegroups.com, thed...@existencehealth.com
Great. Thanks for the feedback.

Victor

unread,
Jul 15, 2015, 1:09:13 PM7/15/15
to cava-us...@googlegroups.com, thed...@existencehealth.com
Hi Marton,

One last question - I attempted to configure CAVA to work with hg18 but had a difficult time finding dbsnp 129. I did locate files such as dbsnp_137.hg18.excluding_sites_after_129.vcf.gz, which should be the same as dbsnp 129, but when I attempted to use this file I received the following error from CAVA:

root@8810f65bd417:/mnt/data/refData# /home/cava-v1.0.0/dbprep.py -g hg18 -s 129 -d /mnt/data/refData/dbsnp_137.hg18.excluding_sites_after_129.vcf -o cavaDbSNP129NCBI36

-----------------------------------------------------------------
CAVA v1.0.0 dbprep (database preparation script) is now running.
Started:  2015-07-14 17:05:17.227135 

dbSNP build: 129
All SNPs are to be retrieved!

Database: /mnt/data/refData/dbsnp_137.hg18.excluding_sites_after_129.vcf
Searching database...

Traceback (most recent call last):
  File "/home/cava-v1.0.0/dbprep.py", line 403, in <module>
    createFile(options)
  File "/home/cava-v1.0.0/dbprep.py", line 31, in createFile
    generateSNPDB(options)
  File "/home/cava-v1.0.0/dbprep.py", line 224, in generateSNPDB
    build=info[:info.index(';')]
ValueError: substring not found

The dbsnp files can be found here:

Márton Münz

unread,
Jul 20, 2015, 6:13:21 PM7/20/15
to cava-us...@googlegroups.com, thed...@existencehealth.com
Hi Victor,
Sorry for the delay!
The input file you want to use has a slightly different format than the format the dbprep tool was made for. However, it was very easy to tweak the code so that now it can read your file. Please use the attached dbprep.py tool and simply run:
"python dbprep.py -s 129 -d dbsnp_137.hg18.excluding_sites_after_129.vcf -o out"
This should generate the output you want.
Note this code will only be able to handle the particular file you attached.
Best,
Marton

On Thursday, July 9, 2015 at 9:51:11 PM UTC+2, Victor wrote:
dbprep.py

Victor

unread,
Jul 24, 2015, 2:48:40 PM7/24/15
to CAVA User Group, munz...@gmail.com
Marton - thank you again, this worked perfectly!
Reply all
Reply to author
Forward
0 new messages