Hi,
Looks like there are a few different things going on here.
In order to get a meaningful ROC, vcfeval needs a score on which to apply thresholding. By default it looks for a genotype quality score (GQ), but you can set any other scoring field on the command line using the "--vcf-score-field" option. FreeBayes by default does not produce a GQ, so you either need to enable it during calling (with FreeBayes use "--genotype-qualities") or select some other score that exists in the FreeBayes output. In the absence of viable score, only an end-point can be produced, which is the last line of the summary you posted.
By default vcfeval evaluates over the full range (union) of the baseline and calls. However, often you will want to restrict the evaluation to particular regions for a variety of reasons, such as for exome calling, to exclude decoy sequences etc., or because the baseline is only defined for certain regions of the genome. This can be done with the "
--evaluation-regions" or "
--bed-regions" depending on your purpose. In particular, the NIST baselines have well-defined high-confidence regions and evaluation should be restricted to those regions by supplying the corresponding BED file from NIST. Further, if the calls are for exome data, you probably want to restrict to the intersection of the NIST high-confidence regions and the exome capture regions (use bedtools or similar to form these intersections). The user manual section for vcfeval explains in more detail the various scoring and region restriction options.