A few more unfixed bugs in the display of samtools mpileup VCF 4.2 output

54 views
Skip to first unread message

Open Genomes

unread,
Aug 31, 2016, 1:09:17 PM8/31/16
to UCSC Genome Browser List
Hello,

There is still a major bug in the support of the VCF output of samtools mpileup:

##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=DPR,Number=R,Type=Integer,Description="Number of high-quality bases observed for each allele">
##INFO=<ID=DPR,Number=R,Type=Integer,Description="Number of high-quality bases observed for each allele">


This throws the following bad format error:


The Number=R is not supported by the regular expression, which only supports digits, A, G, and ".".

"R" signifies the number of alleles plus the Reference. 

See here:

As part of the VCF 4.2 specification, bcftools norm uses "R" when splitting multiallelic variants into multiple biallelics. With the '.' this isn't done correctly, but with the 'R' it is.



Non-fatal errors:

If the VCF output of samtools merge is used directly (with the above correction for Number="R") without being piped through 
bcftools call --ploidy GRCh37 -c
to normalize the VCF, the VCF displays but the SNPs appear as 

  -|
<x>|


The VCF line that causes it:

1 5335785 rs16840442 G <X> 0 . DP=6;I16=4,2,0,0,222,8246,0,0,222,8214,0,0,74,1130,0,0;QS=1,0;MQSB=1;MQ0F=0;DPR=6,0 PL:DP:DV:DPR 0,18,177:6:0:6,0

When the variant is clicked on to get the detail following error is shown at the top of the next page:

Warning/Error(s):
  • FORMAT column should begin with "GT" but begins with "PL"
The genotype is:
Reference allele: (G)-
Alternate allele(s): (G)<X> (Represents allele(s) other than observed.)

Genotype count: 1 (0 phased)
Alleles: (G)-: 2 (100.000%); (G)<X>: 0 (0.000%)

Sample IDGenotypePhased?PLDPDVDPR
SM(G)-/(G)-n0, 18, 177606, 0

screenshot:

Another such VCF line, this one with a known ALT allele:

1 5335794 rs2986188 G A,<X> 0 . DP=6;I16=2,1,1,2,110,4034,115,4419,111,4107,111,4107,39,795,58,1166;QS=0.497738,0.502262,0;VDB=0.131948;SGB=-0.511536;RPB=0.28717;MQB=0.861511;MQSB=0.861511;BQB=0.574341;MQ0F=0;DPR=3,3,0 PL:DP:DV:DPR 87,0,86,97,95,179:6:3:3,3,0


the resulting variant detail page, with the same error, but a somewhat different result.

The Alternate allele should be "A":

Reference allele: (G)-
Alternate allele(s): (G)-, (G)<X> (Represents allele(s) other than observed.)

Genotype count: 1 (0 phased)
Alleles: (G)-: 2 (100.000%); (G)-: 0 (0.000%)

Sample IDGenotypePhased?PLDPDVDPR
SM(G)-/(G)-n87, 0, 86, 97, 95, 179633, 3, 0

screenshot:


It would seem that only the first "<X>" (unknown, non-Reference) allele is shown, while the second allele is ignored. It may be possible to normalize the alleles using the QS field with Number=R.



A minor error:

The ##INFO line for this field is parsed incorrectly. The regex for the string between double quotes is greedy and erroneously goes to the last double quote, and includes the name and value of the next field:


VDB:0.131948  Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3


Thank you, and hopefully these last fixes will enable the direct support of VCF 4.2 generated by samtools mpileup, and the direct display of these files from Galaxy.

Ted Kandell

Matthew Speir

unread,
Sep 2, 2016, 5:56:23 PM9/2/16
to Open Genomes, UCSC Genome Browser List
Hi Ted,

Thank you for your question about VCF 4.2 support in the UCSC Genome Browser.

We are looking into support for this new version of the VCF format and hope to implement it into a future release of our software. However, we can't provide an estimate of when that may be at this time.

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Reply all
Reply to author
Forward
0 new messages