A question about validation of imputation

46 views
Skip to first unread message

Sajal Sthapit

unread,
Dec 19, 2022, 11:07:26 AM12/19/22
to STITCH imputation
Dear Robbie,

I am using STICTCH for imputation of the perennial intermediate wheatgrass breeding population of 1400 plants for testing across 6 generations. I have higher coverage  (`5x-30x) sequencing data (bam files) on 47 plants. To test imputation accuracy I have been doing the following:

1) Downsample the high coverage data to a lower coverage of 0.2x or 0.5x and impute it together with the whole population as well as the high coverage samples. Then after imputation compare the concordance between the high coverage sample genotypes and the corresponding downsampled sample genotypes.

2) Remove high coverage samples (5 at a time) and run STITCH again. Then compare the concordance of the 5, 10, 15, and so on downsampled sample genotypes with the genotypes of the high coverage samples from step (1).

I have been getting  a concordance of ~98 after removing the 30-40%missing calls. The question I have is: 
  • Is it a valid to assess imputation accuracy by comparing genotypes of downsampled samples called by STITCH with the genotypes of high coverage samples also called using STITCH? Or do the high coverage sample genotypes need to be called in a third-party genotype caller other that STITCH, e.g. bcftools mpileup to use for validation?
Thank you for your help and happy holidays.

Sajal Sthapit

Robbie Davies

unread,
Dec 20, 2022, 4:42:00 AM12/20/22
to Sajal Sthapit, STITCH imputation
Hi Sajal,

To check, do you know the founding structure of the population, and do your 47 samples cover the founding specimens? Then you might be able to consider alternate imputation approaches. I'll assume not.

OK your 2 sounds good, as in, I would only recommend comparing genotypes of samples where the high coverage samples weren't included in the imputation. 

Concordance of ~98 sounds good though in general I would recommend r2. Given you have 47 samples you could do per-site, which is what I would recommend. I would also recommend to use the dosage if you are able to. 

I would use a third party genotype software other than STITCH to generate the high coverage genotypes. Either GATK UnifiedGenotyper (yes it's very old), or bcftools mpileup would be fine (these two perform very similarly, bcftools probably easier).

Best,
Robbie

--
You received this message because you are subscribed to the Google Groups "STITCH imputation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stitch-imputat...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stitch-imputation/aba511e2-92a7-49dc-8f6a-3785051bc31fn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages