Brad;
Thanks for the thoughts. I updated the pipeline documentation to help with
these questions, and it should tackle most of the questions you had:
https://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html
but more direct responses below:
> I am trying to get an understanding what the "best" configuration for bcbio
> germline variant calling would be. I previously used GATK joint variant
> calling methods, but in moving into bcbio methods is joint calling still
> preferred over over ensemble? Is it possible to do both? Last I read
> freebayes-joint was not fully operational so Im unclear what is working.
Joint and ensemble calling are two different things. In general, I wouldn't
recommend ensemble unless you have a small number of samples and a specific
need for extra sensitivity. The complexity is not worth the small improvement.
Joint calling is needed for larger batch sizes (50 or more samples) to call
concurrently. I'd use gatk-haplotype-joint with gatk-haplotype for variant
caller unless you don't have access to GATK methods. That scales better than
the custom implemented freebayes-joint method for large sample sizes.
> If I were doing the ensemble approach it seems redundant(?) to use both
> GATK-UnifiedGenotyper and GATK-HaplotypeCaller, regardless if they use
> differing call methods. Would it be worth it to add in something like
> VarDict instead? I have not seen much benchmarking for that caller with
> germline only data.
I would use GATK HaplotypeCaller with joint calling and not worry about
ensemble methods.
> I usually have a large-ish data set (30 to several hundred samples) but
> actual capture targets are ~3Mb or Exome capture. I know I can not utilize
> VQSR for the smaller capture, and sometimes it works for the Exome studies
> depending on data quality. Do need to tell bcbio_nextgen what filtering
> methods to apply or does it handle that for me? Or does it not apply post
> call filters at all? Thank you.
bcbio tries VQSR when you have 50 or more samples and will fall back to GATK's
recommended hard filters if it fails:
https://www.broadinstitute.org/gatk/guide/article?id=2806
So it tries to do the right thing without needing any input from you. VQSR and
the hard filters both perform well in our validations:
http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/
Hope this helps,
Brad