VCF Eval separate ROC curves for indels and SNPs

Felix Jackson

unread,

Feb 8, 2017, 4:22:00 PM2/8/17

to RTG Users

Hi,

I'd like to generate separate precision-recall curves for Indels and SNPs called in my VCFs. Is there any way to persaude VCF Eval to do this?

At the moment the output is data for a mixed weighted ROC curve only, and then TPs and FPs for Indels and SNPs separately.

Thanks,

Felix

Sean Irvine

unread,

Feb 8, 2017, 4:31:31 PM2/8/17

to Felix Jackson, RTG Users

Hi Felix,

Provided you are using a recent release, the vcfeval output directory should contain files named "snp_roc.tsv.gz" and "non_snp_roc.tsv.gz". These files contain the information needed to generate both ROC and precision-sensitivity curves for SNPs and indels separately.

For example, try

rtg rocplot -P vcfeval-output/snp_roc.tsv.gz vcfeval-output/non_snp_roc.tsv.gz

Sean.

--
You received this message because you are subscribed to the Google Groups "RTG Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+unsubscribe@realtimegenomics.com.
Visit this group at https://groups.google.com/a/realtimegenomics.com/group/rtg-users/.

Sean Irvine

unread,

Feb 8, 2017, 10:02:45 PM2/8/17

to Felix Jackson, RTG Users

Hi Felix,

You can find the version using "rtg version".

In order to categorize variants as SNPs or indels, either the baseline representation or the calls representation could be used. These two representations can be quite different. For example, a SNP in one representation could correspond to an insertion followed by a deletion in the other. Real examples can be considerably more complicated with multiple call events matching multiple baseline events, but in such a way that it is not possible to match any subset of the events.

In the "snp_roc.tsv.gz" and "non_snp_roc.tsv.gz" outputs of vcfeval we opted to categorize variants with respect to the calls (because the calls have the scores needed for producing the curves). This means we do have a well-defined sets of true-positive and false-positive calls (i.e. the call either matched something in the baseline or it did not), and hence a well-defined precision = tp / (tp+fp). But there is no correspondingly well-defined definition for the false-negatives (because the representation problem means we cannot decide how many fn's are SNPs versus indels -- at least not without introducing a representation bias to the result), hence we cannot calculate a recall using recall = tp / (tp + fn). For this reason we don't include precision, recall, F-measure in the split out files. Since the "weighted_roc.tsv.gz" file covers all variants the decision problem does not arise and precision, recall, F-measure can be calculated.

(It would be possible to flip this around and categorize events with respect to the baseline. Then you again have well-defined true-positives, and now the false-negatives are well-defined, but the false-positives are no longer well-defined and the baseline variants do not have the score.)

To produce approximate curves from "snp_roc.tsv.gz" and "non_snp_roc.tsv.gz", rocplot uses total-baseline-events - true-positives as a proxy for the false-negatives.

Sean.

On 9 February 2017 at 11:10, Felix Jackson <fojac...@gmail.com> wrote:

Hi Sean,

Thanks for your response. I'm not sure which release I'm using, I'll try to find out. At the moment the tool is outputting the two files you mentio - "snp_roc.tsv.gz" and "non_snp_roc.tsv.gz" - but within these files there is just TP and TN data, which is not sufficient to make ROC/PR curves. These files do not contain any data on sensitivity/precision/f-measure.

Felix

Sean Irvine

unread,

Feb 10, 2017, 3:29:07 PM2/10/17

to Felix Jackson, RTG Users

Hi Felix,

By default vcfeval uses the GQ attribute as the scoring field. GQ makes more sense than QUAL for multisample VCFs as it is per sample rather than the record as a whole.

Scoring by any other numeric field is easily done using the -f (--vcf-score-field) command line option to vcfeval. Doing "-f QUAL" will use the QUAL score.

Some callers have a machine learning score that is better than GQ, with the RTG caller we have an AVR score, and with GATK you can use VQSLOD. I can't recall off the top of my head if FreeBayes has such a score.

For some scores you might also need to change the sort order (using --sort-order). The default ordering assumes that bigger scores are better which is correct for QUAL and GQ.

Sean.

On 11 February 2017 at 08:50, Felix Jackson <fojac...@gmail.com> wrote:

Hi Sean,

Thanks for the clarification, I now understand why separate ROC curves cannot be generated for SNPs and InDels. I now have another question however: what field in the VCF file does vcf-eval use to generate the data for ROC curves? I'm trying to use VCF eval with FreeBayes, but it won't produce the same ROC curve data (from iterative adjusting of a threshold) as it does for vcf files from other variant callers, such as Unified Genotyper and Platypus.

I assumed that vcf-eval would use the QUAL score in vcfs, but is this the case?

Is there any way I can change what metric it uses to iteratively adjust a threshold and call variants accordingly?

Felix

Reply all

Reply to author

Forward