Extraction of the ancestral haplotypes

90 views
Skip to first unread message

Saul Pierotti

unread,
Nov 8, 2022, 9:30:20 AM11/8/22
to STITCH imputation
Hi,

I was wondering whether it is possible to examine the reconstructed ancestral haplotypes after running STITCH. For example, if STITCH can produce a vcf file containing them. Did not find such information in the documentation.

I explain my use-case:
I have shallow sequencing data for a set of F2 crosses in medaka fish (~1800 individuals). Furthermore, I have high-coverage sequences for the F0 inbred lines used in the crosses.
I would like to run STITCH on the F2 cohort and see how well the reconstructed haplotypes match the F0 genotypes that I already have.

Thanks for your help!

Cheers, Saul

Robbie Davies

unread,
Nov 9, 2022, 4:43:25 AM11/9/22
to Saul Pierotti, STITCH imputation
Hi,

To answer your immediate question, I don't seem to have an easy flag that outputs this information, but you can get this from the temporary / intermediate files, using something like the following

 > load(file.path(outputdir, "RData", "EM.all.chr5.RData")) ## replace chr5 as appropriate

!> cbind(pos, t(eHapsCurrent_tc[, , 1])) 

     CHR POS REF ALT      1      2

 1  chr5   1   A   G 0.9999 0.0001

 2  chr5   2   A   G 0.0001 0.9999

 3  chr5   3   A   G 0.0001 0.0001

 4  chr5   4   A   G 0.9999 0.0001

 5  chr5   5   A   G 0.9999 0.0001

 6  chr5   6   A   G 0.9999 0.0001

 7  chr5   7   A   G 0.0001 0.9999

 8  chr5   8   A   G 0.9999 0.0001

 9  chr5   9   A   G 0.9999 0.0001

 10 chr5  10   A   G 0.9999 0.0001

pos is a file with one row per SNP, containing chr pos ref and alt, and eHapsCurrent_tc is a 3 dimensional object where row = haplotype (here K=2), column = SNP, and third dimension = value of s (set to 1, unless you set S > 1, which I don't think is very common). entries of eHapsCurrent_tc indicate dosage for the alternate allele for that haplotype

Note that if you impute in regions, there will be a discontinuity between regions, as e.g. inferred ancestral haps 1 and 2 might swap to 2 and 1 on a subsequent region 

More generally, if you have phased data from all the F0's, then STITCH supports using haplotype reference input, and you can use this exactly in a single iteration, see this for more information
You could also use QUILT in this setting, though if the number of haplotypes is small, STITCH is more exact

Best,
Robbie






--
You received this message because you are subscribed to the Google Groups "STITCH imputation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stitch-imputat...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stitch-imputation/1436fba5-72f7-4850-a8bc-59d9b145aad4n%40googlegroups.com.

Anthony D. Long

unread,
Nov 10, 2022, 11:20:50 AM11/10/22
to STITCH imputation
I think this a very interesting problem that plenty of people are interested in (without knowing it).  Lots of people (for example) would like to genotype individual in a complex cross with parents known from low pass sequencing.  Stitch seems very good at imputing SNPs from low-pass data.  It seems like a simpler strategy would be to take those imputed genotypes and feed them to other programs more explicitly designed for the phasing.  In my mind one of the hurdles is that these other programs often do not like Stitch's {correct} view of the universe that genotypes should be represented as a real number between zero and two, whereas VCF (and the GATK) tend to put these hard-encoded genotypes as the first item in list of info in a VCF file.  Further many of the downstream applications are tied to this historical idea of genotypes coming from SNPchips (where genotypes are again hard-encoded).

We have had luck just sliding through a genome using windows of several hundred SNPs and estimating founder genotype probabilities using limSolve in R.   This is a good strategy when you have known inbred founders. 

Saul Pierotti

unread,
Mar 4, 2023, 9:44:53 AM3/4/23
to STITCH imputation
Following back from this message, I have another question that could be relevant. I have some high-coverage F1 sequencing data (30x) and ~1600 F2 sequencing files at low coverage (0.5-3x).

Do you think would it be possible/advisable to estimate the haplotypes in the F1 population, and then use them as reference haplotypes in the F2 imputation?
Or what about running all the samples together, with the F1s providing additional information for haplotype reconstruction? I am only interested in the F2 imputation at the end.
In this second scenario, do you think that the fact that the F1s would in fact need nGen=1 and the F2s nGen=2 (assuming 2 haplotypes in the F0s) would cause a problem if I set nGen=2 overall?

I did some additional work in the meanitime and I am ruling out the use of the F0 haplotypes/sequencing data for unrelated reasons.

Robert Davies

unread,
Apr 4, 2023, 5:04:52 AM4/4/23
to STITCH imputation
Sorry I missed this. If I ever don't respond in a timely fashion you can email me directly. I don't know what it is with my email but I seem to miss github and google group notifications 

tdlong yes I tried to do the right thing even if other software isn't normally expecting it. I do output the hard called genotypes as well. 

Saul re: your question, "nGen" scales the recombination rate. Imputation is not very sensitive to this parameter. It can probably be off by an order of magnitude without any issue. Otherwise, yes I think joint imputation of F1 and F2 individuals sounds reasonable and simple. If you wanted to get fancy, you could try genotyping and phasing the F1 individuals, then building a small reference panel from that, then inputting that to STITCH directly. My expectation is that imputing using just the F2's will be enough, though with just the F2's, there might be some phase switch errors that clever use of F1 and possibly F2 could solve (this is a lingering issue in STITCH, how best to heuristically find and fix phase switch errors in the ancestral haplotypes. STITCH is good but sometimes still makes mistakes)

Robbie

Reply all
Reply to author
Forward
Message has been deleted
0 new messages