FreeBayes for experimental evolution

877 views
Skip to first unread message

Michael Manhart

unread,
Jan 24, 2013, 7:08:21 PM1/24/13
to free...@googlegroups.com
Hi,

Since I'm quite inexperienced at this point, I'm wondering if FreeBayes is an appropriate tool to find variants in the data I have: 76-bp reads at ~100x coverage in a novel strain of haploid yeast evolved in the lab over 2-3 weeks.  Due to the short time and environmental stress, we expect variants to exist at rather low-frequency but to be largely driven by selection.  I'm asking because most applications I've read about for these variant discovery tools sound very different from mine (mainly diploid, especially human, data from natural populations that have diverged a very long time from the reference, and hence have a lot of neutral variation).

I've been reading over the FreeBayes documentation (as well as the arXiv paper), but I'm confused about the Bayesian model part of this, which I don't yet understand.  But first, is this more complex analysis necessary just to obtain a list of variant sites and estimated qualities, or is it only relevant to estimating the allele frequencies?  To first order I'm mainly interested in a list of these variants (where they occur, how many reads, an estimate of their quality) and not in more sophisticated things like allele frequencies, so I'm happy to do something as similar as possible.

Thanks in advance for any guidance!

Michael

Erik Garrison

unread,
Jan 28, 2013, 6:46:35 PM1/28/13
to free...@googlegroups.com
Hi Michael,

In response to your query (which was by no means the first), I've just implemented a method to use freebayes as a pooled frequency-based caller.  I believe this will bring it in line with your experimental needs.

Here is how you can use freebayes as a frequency-based caller:

    freebayes -f ref.fa --pooled-continuous alignments.bam | vcfkeepinfo - AO RO TYPE
    ....
    chr20_bit       196     .       AT      A       0       .       AO=17;RO=4547;TYPE=del
    chr20_bit       203     .       GA      G       0       .       AO=10;RO=4767;TYPE=del
    chr20_bit       212     .       C       G       0       .       AO=9;RO=5022;TYPE=snp
    chr20_bit       220     .       C       A       0       .       AO=8;RO=5037;TYPE=snp
    chr20_bit       223     .       C       CA      0       .       AO=2;RO=5040;TYPE=ins
    chr20_bit       228     .       T       A       0       .       AO=8;RO=5074;TYPE=snp
    chr20_bit       233     .       T       C       0       .       AO=6;RO=5148;TYPE=snp

Now, you have counts for the reference allele (RO) and each alternate allele (AO) which pass input filters (by default -F 0.2 -C 2).  Make sure to set these lower (e.g. -F 0 -C 1 is completely open) or you will not detect anything below 20% frequency and/or at least 2 observations in the same sample.  That noted, the major way to improve the specificity of the called alleles in this case is to adjust input filters.

You can now use the AO and RO results to establish per-allele, per-site observation frequencies, which should approximate the frequency in your pool.  To make this even easier, I'm considering adding another optional field when --pooled-continuous is specified, EAF (estimated allele frequency).  Do not use the AF field in the output, as this will correspond to the AF under the model.

In some cases, you may find that opening up the filters will slow runtime dramatically by increasing the computational complexity of the analysis.  To avoid this increase, add --report-genotype-likelihood-max.

Additionally, you can use the --max-complex-gap setting to call variable haplotypes defined between alleles not separated by more than a given number of base pairs.  In noisy data, this can help to improve specificity, as your context means you won't be able to rely on the Bayesian model to exclude artifacts.

Notes:
    - vcfkeepinfo is in vcflib/ in the root of the freebayes source tree.)
    - The old "pooled" behavior is retained as --pooled-discrete, and can be mixed with --pooled-continuous.)

Hope this helps!

Best,
Erik

Michael Manhart

unread,
Feb 1, 2013, 6:42:44 PM2/1/13
to free...@googlegroups.com, erik.g...@bc.edu

Hi Erik,

Thank you very much for the response!  This seems to provide the type of information I want, however I do have another question.  Will this produce all variants that pass input filters -- which only filter based on things like coverage, base/mapping quality, etc., i.e., no Bayesian analysis is used?  I think this is my main concern, because I believe the Bayesian model used here (and in other variant callers) is not applicable to my data, since it requires a neutral model as a prior.  So from what I can tell, I just want to filter variants based on simple criteria like those in the FreeBayes input filters, and not anything that comes from a Bayesian analysis.

Thanks for your help with this!

Michael

Michael Manhart

unread,
Feb 5, 2013, 4:48:41 PM2/5/13
to free...@googlegroups.com, erik.g...@bc.edu

Hi Erik,

I think this all looks great for my purposes, but I have a question about vcfkeepinfo -- the source file is in vcflib as you mention, but it looks like it doesn't get compiled/linked by the makefile, so there's no executable.  Do I need to do this myself?

Michael

Erik Garrison

unread,
Feb 5, 2013, 5:09:42 PM2/5/13
to free...@googlegroups.com
Hi Michael,

I think that the version of vcflib in freebayes may be too old to have vcfkeepinfo.  I would just clone vcflib directly from https://github.com/ekg/vcflib.  Please let me know if this works.

Erik

--
You received this message because you are subscribed to the Google Groups "freebayes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebayes+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Erik Garrison

unread,
Feb 5, 2013, 6:34:48 PM2/5/13
to free...@googlegroups.com
Hi Michael,

You can turn everything off using -kwVa (--no-population-priors --hwe-priors-off --binomial-obs-priors-off --allele-balance-priors-off).  I would start by including --pooled-discrete --hwe-priors-off in the command line I sent, then add these other options to see the effect.

I think you are right to be concerned that the models don't match your target data, but the prior models in freebayes are tunable to a large amount and I suggest you experiment to see what works best.

In general, your best results will come from using the maximum amount of information available--- both in terms of available data and in terms of available prior models.  (This idea actually has a name: http://en.wikipedia.org/wiki/Principle_of_maximum_entropy).

Erik

Michael Manhart

unread,
Feb 8, 2013, 1:58:47 PM2/8/13
to free...@googlegroups.com, erik.g...@bc.edu

Hi Erik,

The new vcflib worked fine -- thanks for the help!  Just so you know, the makefile in the vcflib that came with freebayes produces this for me:

make: *** No rule to make target `vcfbreakmulti.cpp', needed by `vcfecho'.  Stop.

Perhaps there's something out of date like you suggest.

So if I understand correctly, --pooled-continuous will cause all variants passing filters to be outputted, although this doesn't affect genotype/AFS computation (which doesn't matter anyway if I'm not using VCF fields related to this).  So if I do want to use genotype information, you're suggesting I can try changing the priors to see how sensitive the results are?  I heard from someone else that at my level of coverage (~80x in most places), the priors won't have a strong effect.  It would be interesting to see how true this is.

Also, I'm guessing my data is considered "pooled," since it consists of reads from an entire yeast population mixed together.  (The "individual" vs. "pooled" distinction is mainly relevant to human data, where you might have sequences all from one person's cells vs. sequences from many people's cells pooled together?)  In this case, presumably I should enable --pooled-discrete to reflect this (at least if I want to use genotype information), but this makes me a bit confused about the meaning of four input filters.  Three (--min-alternate-fraction, --min-alternate-count, --min-alternate-qsum) of these say they are limits on individual data, while another (--min-alternate-total) is a limit on the total population data.  Do the first three mean anything for pooled data, when all your reads from the whole population are mixed together?

Thanks,
Michael
Reply all
Reply to author
Forward
0 new messages