Problems with NLTK package for Ubuntu

69 views
Skip to first unread message

Robin Munn

unread,
Feb 8, 2010, 3:55:16 PM2/8/10
to nltk...@googlegroups.com
How important and/or widely-used is the NLTK support for Mallet?

I ask because the nltk.jar file in nltk-2.0b8.tar.gz (used *only* in
NLTK's Mallet support) is causing me serious problems in trying to get
an NLTK package accepted into Ubuntu. The problem is the following:

1) To get new packages into Ubuntu, you upload the source, not the
binaries. (This allows the uploaded source to be checked for backdoors
and other nasty security-breaking holes; if binaries were accepted,
the package could not be considered secure). This means that instead
of uploading the nltk.jar file from NLTK's tarball
(nltk-2.0b8.tar.gz), the Ubuntu build system needs to re-create it
from the included javasrc/org/nltk/mallet/*.java files.
2) The javasrc/org/nltk/mallet/*.java files, in order to compile, need
Mallet to be present on the system.
3) Mallet has never been packaged for Debian or Ubuntu, so the
automated build system can't automatically install it before building
nltk.jar.

This adds up to the NLTK package having a FTBFS error (Fails To Build
From Source) that I can't overcome. Any package that Fails To Build
From Source is automatically rejected from Ubuntu, period.

The only two solutions I can see right now are:

1) Get Mallet packaged for Ubuntu within the next week or so. I don't
know if I'll manage to pull this off given how hectic my schedule is
currently.
2) Remove Mallet support from the Ubuntu package as a temporary
stopgap measure, with instructions on how to restore it (e.g.,
"download nltk.jar from this URL and put it in this location on your
filesystem, and suddenly your Mallet support will work").

Thoughts?

--
Robin Munn
Robin...@gmail.com
GPG key 0x4543D577

Robin Munn

unread,
Feb 8, 2010, 4:24:37 PM2/8/10
to nltk...@googlegroups.com
On Mon, Feb 8, 2010 at 2:55 PM, Robin Munn <robin...@gmail.com> wrote:
> The only two solutions I can see right now are:
>
> 1) Get Mallet packaged for Ubuntu within the next week or so. I don't
> know if I'll manage to pull this off given how hectic my schedule is
> currently.
> 2) Remove Mallet support from the Ubuntu package as a temporary
> stopgap measure, with instructions on how to restore it (e.g.,
> "download nltk.jar from this URL and put it in this location on your
> filesystem, and suddenly your Mallet support will work").

On investigating http://mallet.cs.umass.edu/, I discovered that the
version of Mallet that NLTK was built around (Mallet 0.4) is no longer
being actively maintained, and indeed the mallet-0.4.tar.gz file dates
from four years ago, in January 2006. The current version of Mallet,
Mallet 2.0, would not work with NLTK's mallet support code due to
different class names.

My inclination right now is to go with option #2 in order to get NLTK
into Lucid. That means removing the code from nltk/classify/mallet.py
and replacing it with code that throws a NotImplementedError (or
perhaps DeprecationWarning) saying something like "This version of
NLTK was built without support for Mallet."

Once a Mallet package has been built for Debian and/or Ubuntu, it will
be possible to include Mallet support in the NLTK .deb package again.
However, the current version of Mallet is 2.0, and that's probably the
version that would get included in Debian and/or Ubuntu, rather than
the four-years-old (and obsolete) Mallet 0.4 that NLTK currently
relies on. Which means that NLTK's Mallet support needs an update
anyway, and I don't think we'd lose too much by removing Mallet
support from the Ubuntu/Debian versions of NLTK for now.

Steven Bird

unread,
Feb 8, 2010, 5:58:01 PM2/8/10
to nltk-dev
Hi Robin,

Sorry to hear about these difficulties. Yes, please go ahead with
your option #2.

One approach would be to follow nltk/tag/__init__.py which does a
conditional import of hmm.py if numpy is installed. This way, any
code that imports the tag package which doesn't happen to use hmm
functionality will still run.

However, please do whatever you think best -- its important to get
this into Lucid!

Thanks,
-Steven (with limited email access in Papua New Guinea)

Robin Munn

unread,
Feb 9, 2010, 4:42:16 PM2/9/10
to nltk...@googlegroups.com
Actually, in testing NLTK's functionality with the Mallet interface
removed, I discovered that it made absolutely no difference -- the
Mallet code was never imported in nltk/classify/__init__.py in the
first place! Likewise, nltk/tag/__init__.py doesn't import anything
from crf.py (the only other place I found in the NLTK code that uses
the Mallet interface). So unless a user specifically tries to do
"import nltk.classify.mallet" or "import nltk.tag.crf", they're never
going to notice that the Mallet interface isn't there.

I imagine that this is an oversight rather than a deliberate design
decision, but it means that most people won't even notice any
difference between the Ubuntu and Mac/Windows packages of NLTK. And it
also means that most people using Mac OS X or Windows won't be able to
access Mallet either, so you may want to consider fixing this in the
next NLTK beta release. :-)

Reply all
Reply to author
Forward
0 new messages