Stitch 1.37 Long Runtimes

118 views
Skip to first unread message

scott small

unread,
Dec 31, 2017, 11:45:24 AM12/31/17
to STITCH imputation
Hi Dr Davies,

Thanks for an awesome and well documented program.

I am currently experiencing long run times, upwards of 8 days for 40mb contigs with around 3million snps for 10 individuals. I am running in parallel using up to 10 cores with the command:

cat chrlist | parallel -P 8 -N 1 './STITCH.R --chr={} --bamlist={}.bamlist --posfile={}.pos --genfile={}.gen --outputdir=./ --K=4 --nGen=100 --nCores=7'  # 8 contigs present in the file

To reduce runtime should I try:

1) subsampling the bam files to lower coverage (avg coverage is 60x)?

2) remove SNP position that have no missing data approx 98% of sites?

What effect does K and nGen have on runtime?

thank you,
scott

Robbie Davies

unread,
Jan 1, 2018, 12:45:14 PM1/1/18
to scott small, STITCH imputation
Happy New Year Scott,

Thanks!

3 million SNPs is quite a few for one process, so the easiest way in general to increase run times may be to parallelize i.e. use regionStart, regionEnd and buffer, or split "{}.pos" into smaller sizes then merge the VCFs together. 

re: 1, coverage
60X coverage is rather high for STITCH. Are any of the samples lower coverage? STITCH is most efficient when most of the samples are low coverage e.g. <2X. STITCH on default mode naturally downsamples to 50X coverage to minimize risk of floating point errors using --downsampleToCov. You could try setting this lower or manually downsampling even further to decrease runtimes. 

re: 2, missing data 
Missing data should be fine, as long as there are at least some reads spanning each SNP across the samples. Are you just trying to impute missing genotypes in those 2% of sites? Then you could definitely remove non-missing private singletons, and in general further thin your dataset without much loss of accuracy (you may want to keep sites within a certain distance e.g. 500bp of the 2% of SNPs with missing data to help correctly infer which reads go with which haplotypes).

nGen has no effect on runtime, while the program has quadratic time computational complexity with respect to K (starts to get really slow over K=40)

Finally 10 individuals may be too few? You probably need more like one hundred, or ideally several hundred, to see good results from STITCH

Best wishes,
Robbie


--
You received this message because you are subscribed to the Google Groups "STITCH imputation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stitch-imputation+unsubscribe@googlegroups.com.
To post to this group, send email to stitch-imputation@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stitch-imputation/7c4965b0-ca9e-4017-a214-e545c351e9fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages