Version 0.20
==========
The main change in this version is the addition of the HMM search
method to the downloadable binary, which now gains the "beta" status.
While enabling this functionality, HMM routines have been closely
inspected and rewritten, resulting in somewhat better stability (many
bugs fixed, but some probably added), and much improved speed of
operation.
Downloadable genomewide binaries (further on referred to as
"geomewide") have a known bug: sometimes the same result line is
printed twice. This will be resolved in the next release.
Other changes:
1. Minor bug was found in the calculation of minimal and maximal
possible scores for the co-occurrence matrix in the HMM-based search
method. As a result, found TFBS scores were in some cases
under-estimated, on the order of 10e-2 - 10e-3 of the normalized
score. This bug is now fixed. In our test case, 4 more genes were
found in a 1262-large gene set after the bug fix - so its effect is
quite negligible.
2. Genomewide binaries were updated to include copyright notice and
citing information. With respect to functionality, genomewide binaries
are maintained in parallel with main COTRASIF binaries, and include
all the fixes.
3. HMM search method will now support only integer PFMs, if the user
decides to supply one in addition to the sequences. Fractional PFMs do
not provide means to find the number of sequences they were built
from, and thus cannot be used by the HMM method. An attempt to use
fractional supplementary PFM for the HMM search will results either in
error or undefined results.
4. PWM method treats integer and fractional matrices differently:
- integer matrices will have pseudocount correction applied, as per publication
- fractional matrices (for the same reason as in point 3 above) will
be used w/o pseudocount correction, i.e.
w(n,i) = log2( freq/p(n) ), where freq is taken directly from the
user-submitted fractional PFM.
For comparison, for integer PFMs w/o pseudocounts freq = f(n,i) / N ,
where f(n,i) is the count of the nucleotide n at position i.
5. A bug was found (and fixed) in HMM's automatic cut-off estimator
(namely, in the p-value calculation function). As a consequence,
estimated cut-offs were lower than one would expect based on published
formula.
6. For the HMM method, as implemented in genomewide binaries, it is
now possible to specify cut-off manually. This functionality will be
added to the web-version soon.
7. HMM's automatic cut-off estimation now can no longer go below
cut-off 0.75; this limit has been set to prevent run-away searches,
when sequences submitted are not really conserved (random). Also, if
any of the submitted sequences has less than 0.75 similarity score to
the model built from all the submitted sequences, then automatic
cut-off estimator is not engaged at all, and cut-off=0.75 is used. If
you happen to need lower cut-off values (e.g. for sequences over 20
nucleotides long), you can always use manual cut-off specification.
8. HMM's function to find minimal and maximal possible values of a 3D
Kx4x4 co-occurrence matrix was enhanced to calculate all possible
matrix states (earlier, an heuristic function was used, which in rare
cases resulted in automatic cut-offs larger than 1.0). As new function
is CPU-intensive, it will be used only for sequences shorter than or
equal to 12 nucleotides (arbitrary number, which might be increased
later). Longer sequences will be using the older heuristic minimax
estimator.
9. HMM's main scoring function was optimized, for about 50+%
performance boost (for sequences longer than 12 nucleotides). For
shorter sequences, performance gain is smaller due to point 8 above.
10. Genomewide PWM implementation is now ~40% faster, and scans 3.5GiB
human repeat-masked genome in 10 minutes (was: 17 minutes).
Intentions for version 0.21
====================
Version 0.21 will be a maintenance release.
- update promoters database to E!54
- update web-interface to allow full-genome search (using genomewide
binaries and Ensembl fasta files)
- display results file size: on status page and in the email
- provide hyperlinked HTML and (gzip/zip/7z ?) archived versions of result files
- database "Statistics" page, with promoter counts and current version
- stricter HMM input checking
- better internal testing (coverage for more test cases), increase stability
- fix: genomewide duplicate result lines
- extend Help page regarding JASPAR vs TRANSFAC differences
- web-interface: allow manual cut-off specification for HMM