known-sites

43 views
Skip to first unread message

Brian Simison

unread,
May 27, 2021, 2:48:45 PM5/27/21
to elprep
Hello,
We are analyzing 96 resequenced genomes of a non-model organism, thus we do not have a  "known-sites" list yet. In the past, we have used GATK4 to generate one by iteratively running through the GATK BQSR tools, which can take months to complete. The elPrep pipeline asks for a "known-sites" list for its BQSR analyses. Do you have a recommendation for generating a denovo "known-sites" list for non-model taxa for the elprep BQSR analyses?

Charlotte Herzeel (imec)

unread,
May 31, 2021, 7:46:49 AM5/31/21
to Brian Simison, elprep
Hi,


On 27 May 2021, at 20:48, Brian Simison <wbs...@gmail.com> wrote:

Hello,
We are analyzing 96 resequenced genomes of a non-model organism, thus we do not have a  "known-sites" list yet. In the past, we have used GATK4 to generate one by iteratively running through the GATK BQSR tools, which can take months to complete. The elPrep pipeline asks for a "known-sites" list for its BQSR analyses. Do you have a recommendation for generating a denovo "known-sites" list for non-model taxa for the elprep BQSR analyses?

This is not something we have tried before and I honestly cannot tell you how to do it.

That said, the elPrep BQSR option is based on the algorithm from GATK and produces the same results. So I assume you can use elPrep BQSR similarly as how you have used GATK BQSR before to generate known sites? Is it possible to explain how you did this with GATK? Would you need elPrep to add specific parameters for the BQSR option?

Thanks.

Kind regards,
Charlotte


--
You received this message because you are subscribed to the Google Groups "elprep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elprep+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elprep/3ec4836b-6ca3-49d8-9492-4f264f7da6dcn%40googlegroups.com.

Brian Simison

unread,
May 31, 2021, 4:22:16 PM5/31/21
to elprep

I created a flow diagram to help describe the steps I have taken to create a de novo know-sites list, but it is prohibitively slow. My approach requires a minimum of three cycles through the GATK loop. The initial round generates a preliminary known-sites table (vcf format) without any BQSR recalibration. I use the GATK4 tool “AnalyzeCovariates” to visualize and compare the BQSR progress towards “stationarity”.

Unfortunately, Haplotype calling is extraordinarily slow. Even elprep sfm haplotypecaller takes weeks for one of 96 of our genomes (~2.5Mb/individual), this translates to more than a year to get through our samples. This, essentially, eliminate our ability to do any legitimate base quality recalibration.

If anybody has an alternate approach, I would be grateful to learn about it.


GATK_k-sites_pipeline_Ggroups-elprep.png

Reply all
Reply to author
Forward
0 new messages