Hi,
I'm happy to output counts, apologies that I haven't had a chance to
re-implement this. I had moved away from them because the counts
themselves are not of direct use in filtering, as they have to be
converted to a sampling probability estimate. The counts increase
bloat in the output.
In the interim, please be aware that there are already per-allele
annotations of strand bias in the form of a phred-scaled estimate of
the probability that the sample of +/- strand for the alternate (SAP)
and reference (SRP) would be as extreme as we see under a binomial
distribution. There are several other similar metrics, EPP (end
placement--- that is, what fraction of the time the allele is placed
in the tail of the read, which is also a strong correlate with error),
and RPP (read placement probability, if the reads tend to be mapped to
the left or right of the locus) which you may find useful. These are
all correlated with systematic sequencing errors of various classes.
More importantly, in its default operation, freebayes uses all of
these counts to estimate the mappability and "sequenceability" of the
alleles and locus which it is analyzing. Extremely high bias suggests
problems with observation, and should decrease confidence in the
results. Other methods, such as VQSR and syscall, take these into
account, but only post-hoc. I think freebayes and snpSVM are the only
methods now which use them directly in calling and genotyping.
Unfortunately, this update hasn't been pushed on the arxiv preprint,
so it's only clear if you read the program help text (see "mappability
priors").
In effect, you should see that sites with very high strand bias have
relatively low reported quality (QUAL). If this isn't the case, then
I'm curious.
Michael and Micha: Do the alleles with high strand bias have
comparatively lower QUAL than alleles that are not systematic
sequencing errors? Do you observe a change when setting
--allele-balance-priors-off? Do you see that the alleles that have
high strand bias are also annotated with high SAP (e.g. > 60)?
Please let me know!
Best,
Erik