Cython extensions

Mikhail Korobov

unread,

Mar 27, 2013, 9:47:51 PM3/27/13

to nltk...@googlegroups.com

Hi,

NLTK could become faster with a bit of Cython code; I want to add a Cython extension with small helper functions in order to make nltk.hmm (and probably nltk.probability) faster.

NLTK currently doesn't require C/C++ compiler to install, and I don't want to change this. Also, Cython extensions could be slow or non-working under PyPy. So we need to make the Cython extensions optional.

The common way to handle this is to try building extensions in setup.py and if this fails then install the package without them (example: https://bitbucket.org/ned/coveragepy/src/c6743d87f327081b2fc9cf6ea50f1df924fb03da/setup.py?at=default).

An another way is to create a *separate* pip-installable package (e.g. "nltk_speedups" - better names are welcome) with necessary Cython extensions.

Then, in both cases, basically use the following pattern:

def foo():
# ... pure-Python version

try:
from somewhere import foo as _foo

_foo_ original = foo

foo = _foo

except ImportError:
pass

(this can be automated like this: https://gist.github.com/kmike/5259525 ).

Disadvantage of optional Cython extensions is that we have 2 versions of the same code. There is "pure" mode in Cython (see http://docs.cython.org/src/tutorial/pure.html ) which allows to maintain only a single version of the code, but it doesn't provide all Cython features, and it is not maintained as well as the rest of Cython.

The advantage of a separate package with extensions is that we avoid all possible installation issues (main NLTK works as before and extensions are totally optional), and that the main NLTK repo is not cluttered with generated C code. Disadvantages are that aligning nltk and nltk-speedups versions could become an issue, and that duplicate code is spread into 2 packages instead of one. I think this approach is viable if an extensions package consists of small helper functions (and not the entire algorithms). It is better to keep extensions small anyway (in order to decrease code duplication and simplify maintenance) - but it may be not always possible to extract "hot" parts.

What do you think? Should we introduce Cython speedups to NLTK, and if we should, then how?

Steven Bird

unread,

Mar 27, 2013, 11:02:53 PM3/27/13

to nltk-dev

Mikhail,

I'm very keen on this idea.

I assume the "hot parts" of the code will usually be the innermost loops. What are the obstacles to extracting these? Perhaps it would be any non-trivial data structures? Or is it something else? It might help to look at a couple of cases in more detail, to see whether the lightweight Cython extensions approach is viable.

-Steven Bird

--
You received this message because you are subscribed to the Google Groups "nltk-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mikhail Korobov

unread,

Mar 29, 2013, 7:37:42 AM3/29/13

to nltk...@googlegroups.com

You're right, looking at a couple of cases should help; I'm trying to do this here: https://github.com/kmike/nltk/tree/hmm-experiments

It is quite clear now that most speed gains are from vectorizing code (by using numpy array operations instead of Python loops). So far I get to the point hmm.demo_pos_bw (using 3 labeled, 20 unlabeled sentences, 2 iterations) runs in 5s with a simple Cython extension (which is about 35 times faster than previous 180s) and in 6.5s without it. This is still quite slow in my opinion, and there is room for improvement. Cython extension is very simple: it contain two ten-line general-purpose functions (only one of them is public).

I hope to cut an another second by vectorizing remaining loops, but then things would become toughter: ConditionalProbDist & friends are inpractical for high-speed code (because it is impossible to vectorize them, not because they are slow per se); I'm converting them to matrices now, but back and forth conversion may become a next bottleneck.

четверг, 28 марта 2013 г., 9:02:53 UTC+6 пользователь Steven Bird написал:

Gustavo Zeferino

unread,

Mar 29, 2013, 7:47:06 AM3/29/13

to nltk...@googlegroups.com

I implemented the hmm code using numpy and without using "ConditionalProbDist & friends" and log-space (everything is in log-space). The execution time drop half. One thing that I did different was the way a vector os logs are summed. I use the formula

$\log _b \sum\limits_{i=0}^N a_i = \log_b a_0 + \log_b \left( 1+\sum\limits_{i=1}^N \frac{a_i}{a_0} \right) = \log _b a_0 + \log_b \left( 1+\sum\limits_{i=1}^N b^{\left( \log_b a_i - \log _b a_0 \right)} \right)$

Gustavo Zeferino

2013/3/29 Mikhail Korobov <kmi...@gmail.com>

Mikhail Korobov

unread,

Mar 29, 2013, 10:08:04 AM3/29/13

to nltk...@googlegroups.com

Thanks for sharing.

Currently nltk.tag.hmm uses the same formula; numpy has logaddexp2 for vectorized calculations of this formula (but logaddexp2 can't sum more that 2 logs - in fact, that is what "general-purpose" Cython function for).

пятница, 29 марта 2013 г., 17:47:06 UTC+6 пользователь Gustavo написал:

Reply all

Reply to author

Forward