On 27 May 2021, at 20:48, Brian Simison <wbs...@gmail.com> wrote:
Hello,We are analyzing 96 resequenced genomes of a non-model organism, thus we do not have a "known-sites" list yet. In the past, we have used GATK4 to generate one by iteratively running through the GATK BQSR tools, which can take months to complete. The elPrep pipeline asks for a "known-sites" list for its BQSR analyses. Do you have a recommendation for generating a denovo "known-sites" list for non-model taxa for the elprep BQSR analyses?
--
You received this message because you are subscribed to the Google Groups "elprep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elprep+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elprep/3ec4836b-6ca3-49d8-9492-4f264f7da6dcn%40googlegroups.com.
I created a flow diagram to help describe the steps I have taken to create a de novo know-sites list, but it is prohibitively slow. My approach requires a minimum of three cycles through the GATK loop. The initial round generates a preliminary known-sites table (vcf format) without any BQSR recalibration. I use the GATK4 tool “AnalyzeCovariates” to visualize and compare the BQSR progress towards “stationarity”.
Unfortunately, Haplotype calling is extraordinarily slow. Even elprep sfm haplotypecaller takes weeks for one of 96 of our genomes (~2.5Mb/individual), this translates to more than a year to get through our samples. This, essentially, eliminate our ability to do any legitimate base quality recalibration.
If anybody has an alternate approach, I would be grateful to learn about it.