I just played a little with Pypy, and NLTK almost works out of the box! At least with pypy 1.6, which is the latest and corresponds to python 2.7.1. If you read on I have a suggestion that we should make a tiny change in .../nltk/sourcedstring.py so that NLTK can be imported in Pypy without problems.
First download the default JIT compiler version (from pypy.org), and follow the installation instructions: Just move the pypy-1.6 directory to /somewhere/ and make a symlink from /somewhere/pypy-1.6/bin/pypy to /usr/local/bin (or something similar). Now you should be able to start Pypy:
$ pypy
Python 2.7.1 (dcae7aed462b, Aug 17 2011, 09:46:15)
[PyPy 1.6.0 with GCC 4.0.1] on darwin
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``PyPy is an exciting technology
that lets you to write fast, portable, multi-platform interpreters with less
effort''
>>>>
Then you need to install yaml. Downloaded it (from pyyaml.org) and use "pypy setup.py install --prefix=/somewhere/pypy-1.6". After this you can test to import yaml (after restarting pypy):
$ pypy
(...)
>>>> import yaml
>>>>
It is important to put the compiled Pypy libraries into a different place, since the CPython .pyc format is different from Pypy's. Otherwise you cannot play with both Pypy and CPython.
This means that you have to delete all .pyc files from the NLTK directory, before switching to Pypy (and switching back):
$ cd .../nltk; rm `find * -name '*.pyc'`
But now if you try to import NLTK, it fails:
>>>> import nltk
Traceback (most recent call last):
(...)
File "/Users/peter/Projekt/NLTK/nltk/nltk/corpus/reader/util.py", line 26, in <module>
from nltk.sourcedstring import SourcedStringStream
File "/Users/peter/Projekt/NLTK/nltk/nltk/sourcedstring.py", line 282, in <module>
class SourcedString(basestring):
TypeError: type 'basestring' is not an acceptable base class
Perhaps this is a bug in Pypy, I don't know. But if you just change line 282 in .../nltk/sourcedstring.py into:
class SourcedString(object):
Now everything just works! So, does this change anything about sourced strings? Well, in principle no, since the classes SourcedString, SimpleSourcedString and CompoundSourcedString are all abstract classes. But I noticed that SourcedString.split() does not return a list of SourcedStrings, but instead just strings. I guess this is because the re module behaves differently in Pypy and CPython. But at least it doesn't change how sourcedstrings behave in CPython.
So, my suggestion is that we do this change in sourcedstring.py, and then NLTK works in Pypy! Well, there are things that do not work, but at least NLTK can be imported and the main core parts work, I think. At least it doesn't hurt CPython.
/Peter
_______________________________________________________________________
peter ljunglöf
department of computer science and engineering
university of gothenburg and chalmers university of technology
--
You received this message because you are subscribed to the Google Groups "nltk-dev" group.
To post to this group, send email to nltk...@googlegroups.com.
To unsubscribe from this group, send email to nltk-dev+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-dev?hl=en.
> It sounds cool! Can all NLTK unit tests passed under PyPy?
Nope. I haven't tried them all, and some fails for me. But at least these tests pass:
featgram, featstruct, grammar, grammartestsuites, internals, japanese, logic, metrics, misc, parse, simple, stem, tokenize, toolbox, tree, treetransforms, util
Some of them I can't test, since I don't have the appropriate programs. Some of them fails but I don't know if it is because of problems in Pypy or if the doctest is obsolete. But at least probability and tag has problems with Pypy, since the default implementation of Numpy in Pypy lacks some things (such as the class float32).
/Peter