Hi,
NLTK could become faster with a bit of Cython code; I want to add a Cython extension with small helper functions in order to make nltk.hmm (and probably nltk.probability) faster.
NLTK currently doesn't require C/C++ compiler to install, and I don't want to change this. Also, Cython extensions could be slow or non-working under PyPy. So we need to make the Cython extensions optional.
An another way is to create a *separate* pip-installable package (e.g. "nltk_speedups" - better names are welcome) with necessary Cython extensions.
Then, in both cases, basically use the following pattern:
def foo():
# ... pure-Python version
try:
from somewhere import foo as _foo
_foo_ original = foo
foo = _foo
except ImportError:
pass
Disadvantage of optional Cython extensions is that we have 2 versions of the same code. There is "pure" mode in Cython (see
http://docs.cython.org/src/tutorial/pure.html ) which allows to maintain only a single version of the code, but it doesn't provide all Cython features, and it is not maintained as well as the rest of Cython.
The advantage of a separate package with extensions is that we avoid all possible installation issues (main NLTK works as before and extensions are totally optional), and that the main NLTK repo is not cluttered with generated C code. Disadvantages are that aligning nltk and nltk-speedups versions could become an issue, and that duplicate code is spread into 2 packages instead of one. I think this approach is viable if an extensions package consists of small helper functions (and not the entire algorithms). It is better to keep extensions small anyway (in order to decrease code duplication and simplify maintenance) - but it may be not always possible to extract "hot" parts.
What do you think? Should we introduce Cython speedups to NLTK, and if we should, then how?