Assistance with parameterisation

Josquin Tibbits

unread,

Nov 9, 2020, 3:34:27 AM11/9/20

to STITCH imputation

Hi. We have been setting up to use STITCH to impute a set of whole genome sequenced samples for a collection of foret tree samples. Despite the genome size being a modest 0.53 Mbp it is super polymorphic and the genome resources are still a bit limited with the reference genome still under construction (by me as well). I am using a provision al version of the genome and it is now scaffolded to a chromosome scale with 11 chromosomes, but under the hood there are many local issues and these affect the read alignments and thus the input data to imputation. That is just to not for consideratio or to allow more questions in that sphere.

SNP discovery has been done using the GATK pipeline and best practices on a carefully samples subset of 138 trees samples from across the natural distributional range of the species. This seems to have worked fairly well and so far I have just looked at Chromosome 9 (39,381,439 bp) which had 2,775,942 biallelic SNP discovered. I have not done a super amount of filetring beyond this as I was hoping STITCH would help with that as described in the 2016 paper....

The main analysis is across a set of about 4600 samples which have a coverage range between 0.01 and around 10x, with most between 1 and 6. For a 1 Mb region the quantiles are:

5% 25% 50% 75% 95%

0.25 0.976 1.641 2.171 3.539

The majority of the samples are from a breeding program and there are lots of relationships across 4 generations. The breeding program was started in the 1980s from a collection from the native forests with a founding set of 86 mothers and up to (but unknown number) of 368 fathers (up to 6 selections from each mother). The sampling includes over 80% of selected parents in each generation except generation 1) (Gen 1 - 386, Gen 2 - 360, Gen 3 - 7 ) and then there are many sibs and half-sibs and offspring from these parents (~3000 of the samples). In addition there are a few hundred samples which are relatives of the founding parents (sibs) and other trees from the native range. So in effect we have a sample which you could say was founded 3 generations ago from 4-500 founders. In the native range the Ne is extremely high with LD breaking down on average within 5000 bp and often common SNP next to each other are not in LD at all. Estimates put Ne at between 100,000 and 2-5 million!

Anyway, we have been slowly working out how to run STITCH and it seems to be able to cope with regions of about 1-2 Mbp which have 1-200K SNP but certainly doesnt like a K of 400 My main issue is in setting reasonable parameters for K and nGen.

Thanks, Josquin

Robbie Davies

unread,

Nov 9, 2020, 6:54:06 AM11/9/20

to Josquin Tibbits, STITCH imputation

Hi Josquin,

Wow that's quite the fun research project! And quite the neat scientific background.

Times like this I wish I had free time to build a different version of STITCH that I think would work better on these kinds of situations. But alas, I am overwhelmed with teaching and marking etc. But I think there are ways you can tweak STITCH that are likely relevant to your situation.

K=40 (or even K=20) should hopefully still be OK. What you might need to tweak are as follows. So you might need to set nGen much higher than you would anticipate, because at K=400 it might be true that everyone would have an ancestor ~3-5 generations ago, at K=20 or K=40, the time until your sample set had 20 or 40 common ancestors might be a lot higher (like 10,000). You should also take a look at expRate, minRate and maxRate the recombination rate bounds (related to nGen) (set to defaults appropriate for mice and human of 0.5 cm / Mb with bounds at 100 and 0.1, but again, might need different for your situation). If you turn on plot_shuffle_haplotype_attempts those plots are usually pretty useful to get a sense of how recombination is affecting things (if you generate one of those plots, I'm happy to take a look and walk you through it). Going from randomly initiated ancestral haplotypes to good ancestral haplotypes requires some powerful heuristics, which often get better with tweaking of parameters like these. Also, I've seen good results before dropping shuffle_bin_radius a lot. This is default 5000, and again, is set up for recombination hotspots like you would see in mice and humans, where recombination is clustered into a small number of hotspots representing a population with only a few active PRDM9 alleles, while your trees, likely do not have recombination hotspots, or there are so many of them as to make "hotspots" almost irrelevant. shuffle_bin_radius controls how much smoothing to perform here. With many many more SNPs you can identify them more easily than in mice / humans, so you don't need to smooth things (in fact, over-smoothing is a problem).

But yeah, happy to take a look at some plots with you, to try and tweak STITCH to work as well as possible here. Either email back to the list or just my gmail. This can be helpful, this tweaking. For example, on work with Frank Chan and colleagues, now on bioRxiv https://www.biorxiv.org/content/10.1101/2020.05.25.113688v2, for butterflies, with very high Ne and recombination rate, we saw substantial improvements in accuracy changing the default parameters (in that case to shuffle_bin_radius = 100, maxDifferenceBetweenReads = 1e3 (reduce impact of each read, as mapping artefacts are high like with your assembly), shuffleHaplotypeIterations <- c(5, 10, 15, 20, 30, 40) (more of these as useful), niterations = 60 (to accomodate more shuffleHaplotypeIterations, K = 10 (appropriate here), nGen = 100000 (wild population).

Best wishes,

Robbie

--
You received this message because you are subscribed to the Google Groups "STITCH imputation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stitch-imputat...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stitch-imputation/e106f768-5304-43f2-94f8-360f71ebc962n%40googlegroups.com.

Josquin Tibbits

unread,

Nov 9, 2020, 9:23:05 PM11/9/20

to STITCH imputation

Hi Robbie,

Thanks so much for your rapid reply. Really appreciated. And, not that it is worth much at all, you can use my endorsement as much if it helps get funding to write a new version more tailored to these situations. That would be super amazing and I am sure extremely useful to many people, but probably not so useful in human genetics!

I will start exploring the analysis with K=40 (and 20) and start to test these other parameters. I have had a look at the bioRxiv paper and this is very useful. I might pester you (or your colleague for some method details for combining the final vcf's as that looks to be very neat). For expRate, minRate and maxRate, I think these will be ok as set as the best estimated gentic map length is 1147 cM and the genome is 530 Mbp equating to 0.46 cm/Mbp, which is amazingly close to human/mouse..... We don't really have much information on hotspot patterns so I think I will just have to play with shuffle_bin_radius.

My plan is to get something runnning reliably while I finish generating all the input data files. I then need to set up some downsamples and a set of known "TRUTH" loci and from these I can start the optimisation process. So, I will probably go quiet for a while and then seek our assistance/guidance once I have something worthwhile asking for help on.

Warmest regards, Josquin

Reply all

Reply to author

Forward