Hello,
There is still a major bug in the support of the VCF output of samtools mpileup:
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=DPR,Number=R,Type=Integer,Description="Number of high-quality bases observed for each allele">
##INFO=<ID=DPR,Number=R,Type=Integer,Description="Number of high-quality bases observed for each allele">
This throws the following bad format error:
The Number=R is not supported by the regular expression, which only supports digits, A, G, and ".".
"R" signifies the number of alleles plus the Reference.
See here:
As part of the VCF 4.2 specification, bcftools norm uses "R" when splitting multiallelic variants into multiple biallelics. With the '.' this isn't done correctly, but with the 'R' it is.
Non-fatal errors:
If the VCF output of samtools merge is used directly (with the above correction for Number="R") without being piped through
bcftools call --ploidy GRCh37 -c
to normalize the VCF, the VCF displays but the SNPs appear as
-|
<x>|
The VCF line that causes it:
1 5335785 rs16840442 G <X> 0 . DP=6;I16=4,2,0,0,222,8246,0,0,222,8214,0,0,74,1130,0,0;QS=1,0;MQSB=1;MQ0F=0;DPR=6,0 PL:DP:DV:DPR 0,18,177:6:0:6,0
When the variant is clicked on to get the detail following error is shown at the top of the next page:
Warning/Error(s):- FORMAT column should begin with "GT" but begins with "PL"
The genotype is:
Reference allele: (G)-
Alternate allele(s): (G)<X> (Represents allele(s) other than observed.)
Genotype count: 1 (0 phased)
Alleles: (G)-: 2 (100.000%); (G)<X>: 0 (0.000%)
Sample ID | Genotype | Phased? | PL | DP | DV | DPR |
---|
SM | (G)-/(G)- | n | 0, 18, 177 | 6 | 0 | 6, 0
|
|
|
screenshot:
Another such VCF line, this one with a known ALT allele:
1 5335794 rs2986188 G A,<X> 0 . DP=6;I16=2,1,1,2,110,4034,115,4419,111,4107,111,4107,39,795,58,1166;QS=0.497738,0.502262,0;VDB=0.131948;SGB=-0.511536;RPB=0.28717;MQB=0.861511;MQSB=0.861511;BQB=0.574341;MQ0F=0;DPR=3,3,0 PL:DP:DV:DPR 87,0,86,97,95,179:6:3:3,3,0
the resulting variant detail page, with the same error, but a somewhat different result.
The Alternate allele should be "A":
Reference allele: (G)-
Alternate allele(s): (G)-, (G)<X> (Represents allele(s) other than observed.)
Genotype count: 1 (0 phased)
Alleles: (G)-: 2 (100.000%); (G)-: 0 (0.000%)
Sample ID | Genotype | Phased? | PL | DP | DV | DPR |
---|
SM | (G)-/(G)- | n | 87, 0, 86, 97, 95, 179 | 6 | 3 | 3, 3, 0 |
screenshot:
It would seem that only the first "<X>" (unknown, non-Reference) allele is shown, while the second allele is ignored. It may be possible to normalize the alleles using the QS field with Number=R.
The ##INFO line for this field is parsed incorrectly. The regex for the string between double quotes is greedy and erroneously goes to the last double quote, and includes the name and value of the next field:
VDB: | 0.131948 | Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3 |
Thank you, and hopefully these last fixes will enable the direct support of VCF 4.2 generated by samtools mpileup, and the direct display of these files from Galaxy.
Ted Kandell