FreeBayes Strand bias

1,587 views
Skip to first unread message

Arnaud GUILLE

unread,
May 28, 2013, 3:16:42 AM5/28/13
to free...@googlegroups.com
I would like to compute Strand bias  with a Fisher test to have the same statistic as other variants caller, but unfortunately i don't find the alternate and reference count on each strand in the vcf file.

Thanks in advance

Erik Garrison

unread,
May 28, 2013, 6:44:46 AM5/28/13
to free...@googlegroups.com
These are recorded internally, so outputting them is no problem.  I suppose it wouldn't hurt to add them again--- they can certainly replace the "average mismatches and indels per read" metrics.  i doubt anyone is using those.


On Tue, May 28, 2013 at 3:16 AM, Arnaud GUILLE <arno....@gmail.com> wrote:
I would like to compute Strand bias  with a Fisher test to have the same statistic as other variants caller, but unfortunately i don't find the alternate and reference count on each strand in the vcf file.

Thanks in advance

--
You received this message because you are subscribed to the Google Groups "freebayes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to freebayes+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Michael Manhart

unread,
Aug 8, 2013, 1:23:34 AM8/8/13
to free...@googlegroups.com, erik.g...@bc.edu

Hi,

I am also interested in strand bias, so I'm wondering what is the status of this --- currently I can't see any info fields for strand counts in my output VCF files.  Was this added in a more recent commit?  (I just checked the last several commit messages and couldn't find a description of this.)

Actually, what I really want is to just filter variants in which all the alternate alleles come from the same strand (or, say, a fraction of them above 90% do).  I've been getting a bunch of these in my data.  I can do this myself if the information is output to the VCFs.

Many thanks,
Michael

Micha Bayer

unread,
Aug 8, 2013, 4:21:39 AM8/8/13
to free...@googlegroups.com, erik.g...@bc.edu
Hi,

I would really appreciate having this too. I have been doing a bit of research into the various categories of false positive SNPs and from some preliminary results I have it looks like systematic sequencing error in Illumina data accounts for the largest portion of false positives (~45%). In many cases, these manifest themselves as variant locations where +/- all of the variants are either in the forward reads only, or in the reverse reads only.

These two papers shed some light on this:

http://www.biomedcentral.com/1471-2105/12/451/

http://nar.oxfordjournals.org/content/39/13/e90.full

So this is a significant problem, and it would be great if freebayes could provide filtering for this. For now, I have a fairly ropey workaround that I have implemented myself, which is based on the fact that the base qualities of the alternate alleles at these systematic error locations are consistently lower than those of the reference alleles, usually by about 3-5 units.

Apparently there is more information to be taken into account though -- see the first paper. The authors of that paper have also developed a tool (syscall) that annotates SNPs with regards to whether it thinks they are systematic errors, but it needs SAM files as input, and I am not sure how well it scales. It would be nice if freebayes itself could provide this functionality too -- it's a great tool in every other respect!

Cheers

Micha

============================================
Dr Micha M Bayer
Bioinformatics Specialist
Information and Computational Sciences Group
James Hutton Institute
Invergowrie
Dundee
DD2 5DA
Scotland, UK
T. +44 (0)1382 568844
http://www.hutton.ac.uk/staff/micha-bayer
============================================
> it, send an email to freebayes+...@googlegroups.com <javascript:> .
> For more options, visit
> https://groups.google.com/groups/opt_out
> <https://groups.google.com/groups/opt_out> .
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "freebayes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to freebayes+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>


________________________________________________________

This email is from the James Hutton Institute, however the views
expressed by the sender are not necessarily the views of the James Hutton
Institute and its subsidiaries. This email and any attachments are confidential and
are intended solely for the use of the recipient(s) to whom they are addressed.
If you are not the intended recipient, you should not read, copy, disclose or rely on
any information contained in this email, and we would ask you to contact the
sender immediately and delete the email from your system. Although the James
Hutton Institute has taken reasonable precautions to ensure no viruses are present
in this email, neither the Institute nor the sender accepts any responsibility for any
viruses, and it is your responsibility to scan the email and any attachments.

The James Hutton Institute is a Scottish charitable company limited by guarantee.
Registered in Scotland No. SC374831
Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA.
Charity No. SC041796

Erik Garrison

unread,
Aug 8, 2013, 7:43:48 AM8/8/13
to free...@googlegroups.com
Hi,

I'm happy to output counts, apologies that I haven't had a chance to
re-implement this. I had moved away from them because the counts
themselves are not of direct use in filtering, as they have to be
converted to a sampling probability estimate. The counts increase
bloat in the output.

In the interim, please be aware that there are already per-allele
annotations of strand bias in the form of a phred-scaled estimate of
the probability that the sample of +/- strand for the alternate (SAP)
and reference (SRP) would be as extreme as we see under a binomial
distribution. There are several other similar metrics, EPP (end
placement--- that is, what fraction of the time the allele is placed
in the tail of the read, which is also a strong correlate with error),
and RPP (read placement probability, if the reads tend to be mapped to
the left or right of the locus) which you may find useful. These are
all correlated with systematic sequencing errors of various classes.

More importantly, in its default operation, freebayes uses all of
these counts to estimate the mappability and "sequenceability" of the
alleles and locus which it is analyzing. Extremely high bias suggests
problems with observation, and should decrease confidence in the
results. Other methods, such as VQSR and syscall, take these into
account, but only post-hoc. I think freebayes and snpSVM are the only
methods now which use them directly in calling and genotyping.
Unfortunately, this update hasn't been pushed on the arxiv preprint,
so it's only clear if you read the program help text (see "mappability
priors").

In effect, you should see that sites with very high strand bias have
relatively low reported quality (QUAL). If this isn't the case, then
I'm curious.

Michael and Micha: Do the alleles with high strand bias have
comparatively lower QUAL than alleles that are not systematic
sequencing errors? Do you observe a change when setting
--allele-balance-priors-off? Do you see that the alleles that have
high strand bias are also annotated with high SAP (e.g. > 60)?

Please let me know!

Best,
Erik

Erik Garrison

unread,
Aug 8, 2013, 1:16:49 PM8/8/13
to free...@googlegroups.com
For those who are interested in the details of the integration between
the strand bias information and the Bayesian model in freebayes,
please refer to this function:
https://github.com/ekg/freebayes/blob/master/src/Genotype.cpp#L1469

The method runs gradient descent in the space of possible genotypings
when attempting to locate a maximum a posteriori genotyping across all
samples. The effect of integrating bias estimates into this approach
is to drive the called genotypes away from states which involve quite
biased observations. Again, this can be tested by specifying -wVa and
observing changes in results and treatment of known systematic
sequencing errors.

The posterior probability of the model in freebayes is expressed as
P(G,S|D), or the probability of a given genotyping (G) for all samples
and that the locus and alleles are correctly characterized (S) or
"sequenceable." As such, we can find the most-probable genotyping
using something like argmax(P(G)P(S)P(D|G)). P(G) is the prior
probability of the particular genotyping under assumptions neutral
drift, as characterized by the genotype combination sampling
probability (a combinatoric term) and the allele frequency sampling
probability (characterized by the Ewens Sampling Formula). P(D|G) are
the data likelihoods, provided by the observation quality information
from the sequencer and read mapper. P(S) is the term of interest, and
is estimated heuristically using the binomial sampling probabilities
of the counts of +/- strand, read position over the allele, allele
position over the reads, and allele balance in heterozygotes. This
term also includes a discrete sampling probability under HWE, which is
often strongly violated in the case of systematic sequencing errors---
although this may not be completely correct theoretically, as it
should be accounted for in the other terms, I've used it because it
appears to improve performance.

Does this help to clarify freebayes' use of metrics associated with
systematic sequencing error?

Best,
Erik

Michael Manhart

unread,
Aug 8, 2013, 2:04:13 PM8/8/13
to free...@googlegroups.com, erik.g...@bc.edu
Thanks as always, Erik, for your fast response.

I do get low QUAL in my VCFs for the variants that appear to have strand bias.  However, right now I'm not using the Bayesian model, but rather doing heuristic filtering based solely on coverage, base/mapping qualities, and read counts/frequency:

        --pooled-continuous \
        --pooled-discrete \
        -kwVa \
        --min-coverage 30 \
        --min-mapping-quality 20 \
        --min-base-quality 10 \
        --min-alternate-fraction 0.05 \
        --min-alternate-count 2 \

So what I'd like is to also filter variants appearing entirely on one strand --- I've seen something like this in other SNP callers.  Adding it into FreeBayes itself would of course be awesome, but if the per-strand read count information can be included in the VCF file, I can also do it myself.

Michael

Erik Garrison

unread,
Aug 8, 2013, 2:32:55 PM8/8/13
to free...@googlegroups.com
Hi Michael,

Did you have problems using the QUAL value for filtering and
annotating potential variants? Do you mean that the QUAL is low
before specifying -kwVa? I'm just curious why you're using heuristic
filters when you have a metric available that's integrating all the
information which you'd like to use. You might be working in a space
that's new to me, and I'd like to understand what performance problems
people are running up against when running freebayes.

Experience has shown that in most cases it's best to use even
low-quality and low-mapping-quality observations, provided the quality
is decently calibrated. (Note also that mapping quality is directly
incorporated into the data likelihoods now, via the "effective base
depth" heuristic first developed in BCM's SNPTools.)

It may be a few days before I can put the change in. However, I'm
*more* than happy to accept patches to add such features, and I'll
help to add parameters etc. to optionally enable them. The changes
can be made entirely within ResultData.cpp (where counts are used when
generating SAP, EPP, RPP, and friends) and by adding the new field
names to the VCF header which is written via AlleleParser.cpp.

It also is a good time for others to mention metrics which they'd like
in the output that aren't currently there!

Erik

Michael Manhart

unread,
Aug 8, 2013, 6:05:41 PM8/8/13
to free...@googlegroups.com, erik.g...@bc.edu

Hi Erik,

My data comes from an experimentally-evolved yeast population, with separate samples corresponding to populations under different environmental conditions and at different time points.  I've been using heuristic filters because the available Bayesian inference methods to detect variants (as implemented here and elsewhere) don't seem right for this data, since the inference assumes an underlying prior allele distribution (neutral Wright-Fisher) which is definitely inappropriate for this experiment.  The overall time of the experiment is rather short (100s of generations) and selection is strong under some of the conditions; so we expect little divergence except at a small number of loci, which will either have low-frequency mutants (if neutral or only weakly selected) or higher-frequency ones (if strongly selected).

What I meant is that when I use -kwVa, the QUAL for the strand bias locations is low.  But with -kwVa the QUAL field is quite low for many of my variants that I strongly suspect are legitimate, based on them appearing over multiple time points (and some external information I have about the phenotypic effects).

If you think the Bayesian model with priors can still be leveraged in my situation, I'm happy to be hear your thoughts.  It would definitely be more rigorous than a heuristic approach, but I want to make sure that it's well-motivated before I use it.

Michael

Erik Garrison

unread,
Aug 8, 2013, 7:19:18 PM8/8/13
to free...@googlegroups.com
Hi Michael,

I understand. I think that specifying --pooled-discrete and
--allele-balance-priors-off (-Ja) should be sufficient to match your
context. --pooled-continuous just outputs information about
everything that passes your inputs filters, so you're right to retain
that. I would tend to keep thresholds lower at first, and raise them
if you see a really clear noise floor below a given cutoff. The ESF
component of the priors makes no Wright-Fisher assumption, but it's
important in that it links expectations about the population mutation
rate (--theta) to your prior expectation of seeing a variant. The
Wright-Frisher like component of the priors is turned off by
--pooled-discrete; it applies only to genotype frequency
probabilities. Note that you'd do this in a lot of cases, such as
families and pooled sequencing experiments, while leaving the rest of
the priors intact.

You should see much lower QUAL for very biased alleles if the
--binomial-obs-priors are turned on. I think it's likely you already
see this because the things which are systematic errors of this type
tend to also have weak signal in the reported base quality. If you've
used BAQ (probably not necessary), the effect will be stronger.

I would experiment with a number of options. It sounds like an
interesting if difficult space in which to do variant detection. Let
me us know what you find to work the best.

As for the count annotations, a message will go out over this list
when I've committed the change to add the count information to the VCF
output. Should be a few days.

Best,
Erik

micha...@hutton.ac.uk

unread,
Aug 9, 2013, 11:41:44 AM8/9/13
to free...@googlegroups.com, erik.g...@bc.edu
Hi Erik et al,

I have made some quick plots of the SNP quality scores versus SAP and SRP, and also against my own derived measure of base qualities to check out your suggestion -- see below.

The plots on the left are plotting the SNP quality score (QUAL) against SAP and SRP respectively. The plot on the right is plotting QUAL against the percent difference between the mean ref allele base qualities and the mean alt allele base qualities, computed as ((meanQualRefAllele - meanQualAltAllele)/meanQualRefAllele)*100.

The Illumina systematic sequencing errors are associated with consistently lower base qualities in the alternate allele, regardless of whether they are caused by GGC motifs (in which case they do have strand bias) or by inverted repeats (in which case there is no strand bias). I don't really understand how SAP and SRP are computed -- perhaps you could explain. Either way, it seems like there isn't a particularly strong trend visible in the plots of these two variables -- is that because only some Illumina error (see above) is associated with strand bias?

The %diff parameter however gives you a fairly clear signal. The red plot symbols are from the SNPs that have been thrown out anything with a % diff greater than 10%. They all have low quality scores. The SNPs that have passed this filter are in blue and there is a much bigger spread of QUAL values here, but mostly higher. To me this is encouraging as it suggests that QUAL reflects systematic error quite nicely, but the problem in this particular dataset is the huge range of coverage (it's RNAseq) and the associated huge range of QUAL values. I can't easily use QUAL alone to throw out SNPs because some of the lower coverage SNPs have low QUAL (*because* of the lower coverage) but are bona fide and not systematic error.

Tricky business this. Perhaps we can all mull this over some more and come up with some ideas.

cheers

Micha

micha...@hutton.ac.uk

unread,
Aug 9, 2013, 11:46:20 AM8/9/13
to free...@googlegroups.com, erik.g...@bc.edu
Something bad happened to my picture... :-(

Here is another attempt.
plots.png

Erik Garrison

unread,
Aug 9, 2013, 12:29:43 PM8/9/13
to free...@googlegroups.com

Thanks for the analysis of this!  What parameters did you use?  Were they the ones you listed previously?

Michael Manhart

unread,
Aug 10, 2013, 11:01:53 PM8/10/13
to free...@googlegroups.com, erik.g...@bc.edu
Hi Erik,

Thanks for your input.  Out of curiosity, which thresholds would you suggest relaxing?  I actually thought the ones I'm using were already pretty low.  Average coverage is around or above 100 in all my samples, so 30 seemed like a fairly low cutoff.  The alternate allele read count and frequency also seem pretty low already.  The MAPQ threshold looks high, although the aligner only outputs a small number of possibilities for this number (0, 1, 23, or something high like 40).  So there is no difference between setting it at 10 or 20, for instance.  I feel like MAPQ = 0 reads should be eliminated because those are usually reads that mapped to many reference locations (and my reference isn't perfect, so I'm most afraid of alignment artifacts), but perhaps I could include MAPQ = 1.

Michael

Micha Bayer

unread,
Aug 13, 2013, 4:44:46 AM8/13/13
to free...@googlegroups.com
Hi Erik,

these are the parameters I used:

--ploidy 2
--no-population-priors
--min-alternate-count 1
--min-alternate-total 3
--min-alternate-fraction 0.9

I switch off the priors because I work on crop plants and the assumptions about allele frequency don't apply in a mix of cultivars.

Cheers

Micha

============================================
Dr Micha M Bayer
Bioinformatics Specialist
Information and Computational Sciences Group
James Hutton Institute
Invergowrie
Dundee
DD2 5DA
Scotland, UK
T. +44 (0)1382 568844
http://www.hutton.ac.uk/staff/micha-bayer
============================================

> -----Original Message-----
> From: free...@googlegroups.com [mailto:free...@googlegroups.com] On
> Behalf Of Erik Garrison
> Sent: 09 August 2013 17:30
> To: free...@googlegroups.com
> Subject: Re: FreeBayes Strand bias
>
> <http://www.biomedcentral.com/1471-2105/12/451/>
> >
> > http://nar.oxfordjournals.org/content/39/13/e90.full
> <tel:%2B44%20%280%291382%20568844>
> > http://www.hutton.ac.uk/staff/micha-bayer
> <https://groups.google.com/groups/opt_out> .
> https://groups.google.com/groups/opt_out
> <https://groups.google.com/groups/opt_out> .
> >
> >
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "freebayes" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to freebayes+...@googlegroups.com
> <mailto:freebayes%2Bunsu...@googlegroups.com> .
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "freebayes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to freebayes+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>


Michael Manhart

unread,
Sep 11, 2013, 4:09:37 PM9/11/13
to free...@googlegroups.com
Hi Erik,

Sorry to bug you, but I was wondering about the status of outputting strand bias information in FreeBayes.  If you're still working on it or you just haven't gotten to it yet, no problem, my apologies, but I just wanted to see if I missed an update somewhere.

Michael

Erik Garrison

unread,
Sep 13, 2013, 6:09:52 PM9/13/13
to free...@googlegroups.com
Hi Michael,

I've added these parameters back to the freebayes output.

What do you think about including the counts for read balance across allele (RPP) and allele cycle position (EPP)?

Please bug away, sometimes I need a reminder :)

Erik


--

Micha Bayer

unread,
Sep 18, 2013, 4:41:07 AM9/18/13
to free...@googlegroups.com
Hi Erik,

I am not sure what allele cycle position is, but I would definitely welcome having the RPP output. This should really help with filtering at least some of the SNPs that are caused by systematic sequencing error in Illumina data.

Thanks also for adding the strand observation counts -- they too will be really useful!

Cheers

Micha

============================================
Dr Micha M Bayer
Bioinformatics Specialist
Information and Computational Sciences Group
The James Hutton Institute
Invergowrie
Dundee
DD2 5DA
Scotland, UK
T. +44 (0)1382 568844
http://www.hutton.ac.uk/staff/micha-bayer
============================================


> -----Original Message-----
> From: free...@googlegroups.com [mailto:free...@googlegroups.com] On
> Behalf Of Erik Garrison
> Sent: 13 September 2013 23:10
> To: free...@googlegroups.com
> Subject: Re: FreeBayes Strand bias
>
> <mailto:freebayes%2Bunsu...@googlegroups.com> .
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "freebayes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to freebayes+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.


Reply all
Reply to author
Forward
0 new messages