Re: [rtg-users] RTG output on exomes

18 views

Skip to first unread message

Len Trigg

unread,

Jul 5, 2023, 7:15:05 PM7/5/23

to Peter Pruisscher, RTG Users

Hi Peter,

You can really only have reliable metrics when you evaluate on the regions that are in common between your capture regions and the NIST high confidence regions (method 3). And those metrics only tell you about the accuracy in those regions, so you can't necessarily assume that the metrics will be the same outside of those regions. For example, the regions in your capture regions but not in NIST high confidence regions are probably harder to call accurately (otherwise they would have been included in the NIST high confidence regions in the first place), but there's no quantification on how much harder.

Cheers,

Len.

On Tue, 4 Jul 2023 at 08:44, Peter Pruisscher <peter.pr...@scilifelab.se> wrote:

Dear everyone,
I have a question on using exome samples with the rtg vcfeval tool, and I was hoping you could help me make this decision. I have a sample for which the exome has been sequenced (using a twist kapa bait set). I have run vcfeval on this sample in three ways, 1) against the golden truth set of a nist sample (whole genome coordinates), 2) on the twist bait capture bed file (should be all regions present in the exome fastq files), and 3) on all nist golden truth set bed regions that occur within the twist bait set bed file (bedtools intersect, keeping all nist base pairs overlapping within the exome bed file sequences).

The output of run 1 is not very good, but this is expected, as there are a lot of missing variants not capture by the exome approach, . For the second run using the twist bed file, precision and f-measure are around 0.9, and there are a bunch of false positives more than expected (false positives are about 15% of the total number of true positive baseline). For the third run, with the most restricted bed file, I get very high numbers, showing that the nist regions within the exome bait set have a very high precision.

In my mind I should be using the twist bait set bed file, as it represents the sequences that are represented in the data, however I see a high number of false positives when I do this. When I intersect the two files I see a lot fewer false positives, but I don't know whether I am misrepresenting the data in that way. Could I ask for your thoughts on this?

Thank you for your time.
Best, Peter

--
You received this message because you are subscribed to the Google Groups "RTG Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+...@realtimegenomics.com.
To view this discussion on the web visit https://groups.google.com/a/realtimegenomics.com/d/msgid/rtg-users/63bd174d-8e84-4bbd-ae17-5a8a12776777n%40realtimegenomics.com.

Peter Pruisscher

unread,

Jul 7, 2023, 2:28:16 AM7/7/23

to RTG Users, Len Trigg, Peter Pruisscher

Hi Len,

Thank you for the explanation, this also how I will move forward. Have a good summer!

Best,

Peter

Reply all

Reply to author

Forward

0 new messages