version 0.20: HMM rewrite

1 view
Skip to first unread message

COTRASIF

unread,
May 22, 2009, 7:54:32 PM5/22/09
to cotr...@googlegroups.com
I'm skipping version 0.19: a lot more work was done since 0.18,
comparing to other version increases.

Version 0.20
==========

The main change in this version is the addition of the HMM search
method to the downloadable binary, which now gains the "beta" status.
While enabling this functionality, HMM routines have been closely
inspected and rewritten, resulting in somewhat better stability (many
bugs fixed, but some probably added), and much improved speed of
operation.

Downloadable genomewide binaries (further on referred to as
"geomewide") have a known bug: sometimes the same result line is
printed twice. This will be resolved in the next release.

Other changes:

1. Minor bug was found in the calculation of minimal and maximal
possible scores for the co-occurrence matrix in the HMM-based search
method. As a result, found TFBS scores were in some cases
under-estimated, on the order of 10e-2 - 10e-3 of the normalized
score. This bug is now fixed. In our test case, 4 more genes were
found in a 1262-large gene set after the bug fix - so its effect is
quite negligible.

2. Genomewide binaries were updated to include copyright notice and
citing information. With respect to functionality, genomewide binaries
are maintained in parallel with main COTRASIF binaries, and include
all the fixes.

3. HMM search method will now support only integer PFMs, if the user
decides to supply one in addition to the sequences. Fractional PFMs do
not provide means to find the number of sequences they were built
from, and thus cannot be used by the HMM method. An attempt to use
fractional supplementary PFM for the HMM search will results either in
error or undefined results.

4. PWM method treats integer and fractional matrices differently:
- integer matrices will have pseudocount correction applied, as per publication
- fractional matrices (for the same reason as in point 3 above) will
be used w/o pseudocount correction, i.e.

w(n,i) = log2( freq/p(n) ), where freq is taken directly from the
user-submitted fractional PFM.

For comparison, for integer PFMs w/o pseudocounts freq = f(n,i) / N ,
where f(n,i) is the count of the nucleotide n at position i.

5. A bug was found (and fixed) in HMM's automatic cut-off estimator
(namely, in the p-value calculation function). As a consequence,
estimated cut-offs were lower than one would expect based on published
formula.

6. For the HMM method, as implemented in genomewide binaries, it is
now possible to specify cut-off manually. This functionality will be
added to the web-version soon.

7. HMM's automatic cut-off estimation now can no longer go below
cut-off 0.75; this limit has been set to prevent run-away searches,
when sequences submitted are not really conserved (random). Also, if
any of the submitted sequences has less than 0.75 similarity score to
the model built from all the submitted sequences, then automatic
cut-off estimator is not engaged at all, and cut-off=0.75 is used. If
you happen to need lower cut-off values (e.g. for sequences over 20
nucleotides long), you can always use manual cut-off specification.

8. HMM's function to find minimal and maximal possible values of a 3D
Kx4x4 co-occurrence matrix was enhanced to calculate all possible
matrix states (earlier, an heuristic function was used, which in rare
cases resulted in automatic cut-offs larger than 1.0). As new function
is CPU-intensive, it will be used only for sequences shorter than or
equal to 12 nucleotides (arbitrary number, which might be increased
later). Longer sequences will be using the older heuristic minimax
estimator.

9. HMM's main scoring function was optimized, for about 50+%
performance boost (for sequences longer than 12 nucleotides). For
shorter sequences, performance gain is smaller due to point 8 above.

10. Genomewide PWM implementation is now ~40% faster, and scans 3.5GiB
human repeat-masked genome in 10 minutes (was: 17 minutes).

Intentions for version 0.21
====================

Version 0.21 will be a maintenance release.

- update promoters database to E!54
- update web-interface to allow full-genome search (using genomewide
binaries and Ensembl fasta files)
- display results file size: on status page and in the email
- provide hyperlinked HTML and (gzip/zip/7z ?) archived versions of result files
- database "Statistics" page, with promoter counts and current version
- stricter HMM input checking
- better internal testing (coverage for more test cases), increase stability
- fix: genomewide duplicate result lines
- extend Help page regarding JASPAR vs TRANSFAC differences
- web-interface: allow manual cut-off specification for HMM

Bogdan

unread,
May 22, 2009, 8:00:21 PM5/22/09
to cotrasif
Important note: immediately after updating COTRASIF to the new
version, I've found that PWM task submission sometimes leads to a
blank screen, and task submission fails with no error messages. The
reason is not yet clear. This problem will be resolved as soon as
possible.

Bogdan

unread,
Jun 22, 2009, 5:50:04 AM6/22/09
to cotrasif
The WSOD (White Screen Of Death) issue (a.k.a. 'blank screen issue')
is now fixed.

The cause of it was the removal of RPCs (remote procedure calls) from
the 'interface' server to the 'worker' server, after we got a single
server for both interface and worker.

Blank screen appeared when some of the parameters of the task had
errors, e.g. cut-off was lower than 0.4 (which isn't allowed for web-
version), or email address was invalid. Both PWM and HMM were
affected.
Reply all
Reply to author
Forward
0 new messages