bcbio nextgen Germline pipeline

121 views
Skip to first unread message

Brad Wubbenhorst

unread,
Oct 1, 2015, 3:27:17 PM10/1/15
to biovalidation
I am trying to get an understanding what the "best" configuration for bcbio germline variant calling would be. I previously used GATK joint variant calling methods, but in moving into bcbio methods is joint calling still preferred over over ensemble? Is it possible to do both? Last I read freebayes-joint was not fully operational so Im unclear what is working.

If I were doing the ensemble approach it seems redundant(?) to use both GATK-UnifiedGenotyper and GATK-HaplotypeCaller, regardless if they use differing call methods. Would it be worth it to add in something like VarDict instead? I have not seen much benchmarking for that caller with germline only data.

I usually have a large-ish data set (30 to several hundred samples) but actual capture targets are ~3Mb or Exome capture. I know I can not utilize VQSR for the smaller capture, and sometimes it works for the Exome studies depending on data quality. Do need to tell bcbio_nextgen what filtering methods to apply or does it handle that for me? Or does it not apply post call filters at all? Thank you.

-bwubb

Brad Chapman

unread,
Oct 5, 2015, 5:51:54 AM10/5/15
to Brad Wubbenhorst, biovalidation

Brad;
Thanks for the thoughts. I updated the pipeline documentation to help with
these questions, and it should tackle most of the questions you had:

https://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html

but more direct responses below:

> I am trying to get an understanding what the "best" configuration for bcbio
> germline variant calling would be. I previously used GATK joint variant
> calling methods, but in moving into bcbio methods is joint calling still
> preferred over over ensemble? Is it possible to do both? Last I read
> freebayes-joint was not fully operational so Im unclear what is working.

Joint and ensemble calling are two different things. In general, I wouldn't
recommend ensemble unless you have a small number of samples and a specific
need for extra sensitivity. The complexity is not worth the small improvement.

Joint calling is needed for larger batch sizes (50 or more samples) to call
concurrently. I'd use gatk-haplotype-joint with gatk-haplotype for variant
caller unless you don't have access to GATK methods. That scales better than
the custom implemented freebayes-joint method for large sample sizes.

> If I were doing the ensemble approach it seems redundant(?) to use both
> GATK-UnifiedGenotyper and GATK-HaplotypeCaller, regardless if they use
> differing call methods. Would it be worth it to add in something like
> VarDict instead? I have not seen much benchmarking for that caller with
> germline only data.

I would use GATK HaplotypeCaller with joint calling and not worry about
ensemble methods.

> I usually have a large-ish data set (30 to several hundred samples) but
> actual capture targets are ~3Mb or Exome capture. I know I can not utilize
> VQSR for the smaller capture, and sometimes it works for the Exome studies
> depending on data quality. Do need to tell bcbio_nextgen what filtering
> methods to apply or does it handle that for me? Or does it not apply post
> call filters at all? Thank you.

bcbio tries VQSR when you have 50 or more samples and will fall back to GATK's
recommended hard filters if it fails:

https://www.broadinstitute.org/gatk/guide/article?id=2806

So it tries to do the right thing without needing any input from you. VQSR and
the hard filters both perform well in our validations:

http://bcb.io/2014/05/12/wgs-trio-variant-evaluation/

Hope this helps,
Brad

Brad Wubbenhorst

unread,
Oct 5, 2015, 2:50:32 PM10/5/15
to biovalidation
Very big help, Thank You.

Brad Wubbenhorst

unread,
Oct 5, 2015, 3:14:19 PM10/5/15
to biovalidation
From the updated documentation:

"Joint calling – This calls samples independently, then combines them together into a single callset by integrating the individual calls."

I am confused by this. I thought joint calling was the exact opposite. Normal GATK Best Practices (correct me if Ive misunderstood something) would have me provide a list of bams that the caller walks over simultaneously, providing a genotype for every sample; letting the data of all samples inform the caller. 

Or am I reading too much into the "calls samples independently..."? The description sounds like bcbio is just merging independent vcf files into one. If that is true then maybe I should be doing a pooled call, but I do not see what the difference between the methods would be.

Brad Chapman

unread,
Oct 5, 2015, 3:47:32 PM10/5/15
to Brad Wubbenhorst, biovalidation

Brad;
Sorry for the confusion. The nomenclature in this space is not settled and
this post has definitions of the terminology we try to use in bcbio:

http://bcb.io/2014/10/07/joint-calling/

So I think you understand right but that sentence in the docs needs
improvement and leads you to, as you say, read too much into it. What it's
trying to say is that it uses information from all calls but doesn't require
simultaneous access to all BAMs like pooled calling.

If you're able to better describe it in the docs I would be happy to merge a
pull request. Having others read it really helps identify areas where the
writing is not clear. Hope this helps,
Brad
> --
> You received this message because you are subscribed to the Google Groups "biovalidation" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to biovalidatio...@googlegroups.com.
> To post to this group, send email to bioval...@googlegroups.com.
> Visit this group at http://groups.google.com/group/biovalidation.
> For more options, visit https://groups.google.com/d/optout.

Brad Wubbenhorst

unread,
Oct 15, 2015, 12:12:23 PM10/15/15
to biovalidation
Thank you for clearing that up for me. I was hoping I could get some clarification on the yaml structure for a joint-variant calling analysis.

The template yaml is structured as:

details:
 
- analysis: variant2
    lane
:
    description
:
    files
:
    genome_build
: GRCh37
    metadata
:
      batch
: Batch1
    algorithm
:
      aligner
: bwa
      mark_duplicates
: true
      recalibrate
: gatk
      realign
: gatk
      variantcaller
: gatk-haplotype
      jointcaller
: gatk-haplotype-joint
      variant_regions
: /path/to/your.bed



Where as the readthedocs shows as an example:

- description: Sample1
  algorithm
:
    variantcaller
: gatk-haplotype
    jointcaller
: gatk-haplotype-joint
  metadata
:
    batch
: Batch1
- description: Sample2
  algorithm
:
    variantcaller
: gatk-haplotype
    jointcaller
: gatk-haplotype-joint
  metadata
:
    batch
: Batch1




So should I be repeating from the list memebers '-analysis: variant2' (If Im even understanding yaml syntax correctly) or does that just need to be specified once? If the latter, how might I need to alter hyphenation and indentation? Thank you.


Brad Chapman

unread,
Oct 15, 2015, 12:43:35 PM10/15/15
to Brad Wubbenhorst, biovalidation

Brad;
You do need the entire details section repeated for each sample. The example
is trying to demonstrate the relevant sections of the configuration in a small
way, rather than show a full example.

If you find the YAML syntax tricky, your best bet would be to use the template
and add your batch and other parameters into a CSV file and use bcbio's
automated configuration to write the final output file:

https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-configuration

Hope this helps,
Brad

Brad Wubbenhorst

unread,
Oct 15, 2015, 1:47:18 PM10/15/15
to biovalidation
Thank you again.

I have been avoiding your automated configuration because it does not fit well with my personal quirks, project architecture, and data. Really, Ive just written my own because often I have multiple fastq pairs per sample.

What are the limitations for what can be included in a batch? Can I put a fastq file pair for some samples and a pre-aligned bam for others in the same joint calling batch or would it be recommended to do a 'two part' execution where I align every sample first and then write the config file using the bams?

I also have Exome data from two separate sources, which appear to have used different exome capture libraries (eg. Agilent and Nimblegen). Could I combine these samples under the same batch or would it be advised to either run them separately or create some sort of targets.bed from the overlap of the two designs? Just trying to figure out my options.

Brad Chapman

unread,
Oct 18, 2015, 9:28:43 PM10/18/15
to Brad Wubbenhorst, biovalidation

Brad;

> What are the limitations for what can be included in a batch? Can I put a
> fastq file pair for some samples and a pre-aligned bam for others in the
> same joint calling batch or would it be recommended to do a 'two part'
> execution where I align every sample first and then write the config file
> using the bams?

Yes, this is fine. bcbio will do the alignment of those that need it and feed
everything together during variant calling for batching. As long as the
pre-aligned inputs are correctly setup with read groups and the same aligner
as the un-aligned that's all no problem.

> I also have Exome data from two separate sources, which appear to have used
> different exome capture libraries (eg. Agilent and Nimblegen). Could I
> combine these samples under the same batch or would it be advised to either
> run them separately or create some sort of targets.bed from the overlap of
> the two designs? Just trying to figure out my options.

Personally I'd run these separately and avoid mixing different captures. But
it's also fine to either do the union or overlap of the two designs if you
prefer that approach. Either way is fine with bcbio but I really haven't done
this so don't have a good intuition about which way performs better. Sorry to
not have more experience with this but happy to hear how it works for you,
Brad
Reply all
Reply to author
Forward
0 new messages