Issue 604 in nltk: GermaNet Integration in NLTK

175 views
Skip to first unread message

nl...@googlecode.com

unread,
Oct 29, 2010, 3:18:52 AM10/29/10
to nltk-...@googlegroups.com
Status: New
Owner: ----
Labels: Type-Enhancement Priority-Medium

New issue 604 by johannes.maucher: GermaNet Integration in NLTK
http://code.google.com/p/nltk/issues/detail?id=604

We have started a students project to integrate GermNet into NLTK. The
interface will provide access functions to GermaNet similar to the access
functions available for WordNet. However, there will be 2 main differences:
1.) GermaNet does not provide as much information as WordNet. Thus the API
will provide only a subset of the WordNet functions.
2.) GermaNet can not be distributed within NLTK since each user has to
apply for a license (free for research and education).

In addition we plan to integrate further German-specific functionality into
NLTK, e.g. lemmatisation.

nl...@googlecode.com

unread,
May 16, 2011, 4:49:02 PM5/16/11
to nltk-...@googlegroups.com

Comment #1 on issue 604 by schreon....@googlemail.com: GermaNet Integration
in NLTK
http://code.google.com/p/nltk/issues/detail?id=604

The results of the student project mentioned above have been bundled to one
zip file (see attachment).

The introduction is currently under development. You can view it on this
public google doc: http://goo.gl/cABPy

The zip file contains:

- the current version of the Germanet Corpus Reader
- the current version of the German Wortschatz Lemmatizer
- technical documentation created via EpyDoc
- the data of Projekt Deutscher Wortschatz the lemmatizer relies on.
ATTENTION, the data may not be used for commercial purposes. Please visit
http://wortschatz.uni-leipzig.de/ for further information
- doctests

Please note:
- GermaNet Version 5.3 must be obtained manually (NOT the current version
6.0)
- We do not have the permission to distribute the GermaNet XML Files
ourselves. But it is possible to obtain a free license for educational and
reseach purposes.
- Please see http://www.sfs.uni-tuebingen.de/GermaNet/ for further
information on licensing

Attachments:
germanltk_bundle.zip 7.5 MB

nl...@googlecode.com

unread,
Feb 11, 2012, 10:27:18 AM2/11/12
to nltk-...@googlegroups.com

Comment #3 on issue 604 by johannes...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

In the indicated document (http://goo.gl/cABPy /
https://docs.google.com/document/d/1rdn0hOnJNcOBWEZgipdDfSyjJdnv_sinuAUSDSpiQns/edit?hl=en)
I'd like to see the instructions to install the German Wortschatz
Lemmatizer.
In the chapter German Wortschatz Lemmatizer -> Installation -> Some manual
steps the files to copy are not mentioned.

nl...@googlecode.com

unread,
Feb 11, 2012, 4:43:17 PM2/11/12
to nltk-...@googlegroups.com

Comment #4 on issue 604 by philippn...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

hello,

I just tried it myself. The files that need to get copied are the ones that
are inside the 'stem' directory of the zipfile
(GermanWortschatzDBBuilder.py and GermanWortschatzLemmatizer.py).
Destination is <nltk-install-directory>/stem.

Then add 'from GermanWortschatzLemmatizer import *' to the end of the
importlist of the __init__.py and then start up python.

Import the lemmatizer like so:
'from nltk.stem import GermanWortschatzLemmatizer as gwl'

At first launch the tk filechoser gui should show up. pick the base
textfile that came with the zip.

There may occur an error "Could not
open ../baseforms_by_projekt_deutscher_wortschatz.txt". The errormessage is
actually incorrect. The actual problem is that the
folder '../nltk_data/stemmers/GermanWortschatzLemmatizer' doesn't exist and
should therefore be created manually (this problem may not occur on
windows).

Sorry for the inconvenience

nl...@googlecode.com

unread,
Feb 12, 2012, 11:04:47 AM2/12/12
to nltk-...@googlegroups.com

Comment #5 on issue 604 by johannes...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

Hi, thanks for the information, it was already useful.

The __init.__.py you were referring to is also located in
<nltk-install-directory>/stem

As you told me tk is needed for the next step, also indicated by python
with the message
ImportError: No module named _tkinter, please install the python-tk package

The solution is indicated on
http://stackoverflow.com/questions/4783810/install-tkinter-for-python
and helped me on linux with
sudo apt-get install python-tk

Then launching python and
from nltk.stem.snowball import GermanStemmer
opens the mentioned file dialog box.

What is the solution to the "Could not
open ../baseforms_by_projekt_deutscher_wortschatz.txt"?
It exits python.
[I'm running this on Linux and I had created the directory
/usr/local/lib/python2.7/dist-packages/nltk_data/stemmers/GermanWortschatzLemmatizer]

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.stem.snowball import GermanStemmer
Could not find baseform db ... Running Build Script
user input:
/home/johannes/Downloads/germanltk/baseforms_by_projekt_deutscher_wortschatz.txt
Could not open
/home/johannes/Downloads/germanltk/baseforms_by_projekt_deutscher_wortschatz.txt


Thanks a lot for your help.

nl...@googlecode.com

unread,
Feb 13, 2012, 5:18:08 AM2/13/12
to nltk-...@googlegroups.com

Comment #6 on issue 604 by philippn...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

hi,

> The __init.__.py you were referring to is also located in
> <nltk-install-directory>/stem

correct.

> What is the solution to the "Could not
> open ../baseforms_by_projekt_deutscher_wortschatz.txt"?

> [I'm running this on Linux and I had created the directory
> /usr/local/lib/python2.7/dist-packages/nltk_data/stemmers/GermanWortschatzLemmatizer]

As said, the error message is wrong. It fails to create the 'baseform.db'
file but says it cannot find the *.txt. However the script assumes that
your nltk_data directory is located in your home directory.

What you can do is, create the directory
/home/johannes/nltk_data/stemmers/GermanWortschatzLemmatizer

run the build script. after it finishes copy the 'baseform.db' from there
to the correct location you mentioned under /usr/local/.

If it works you can then delete the directory you created in your home-dir.
I hope it does work now. The buildscript is really sloppy.

bye

nl...@googlecode.com

unread,
Feb 17, 2012, 6:38:27 AM2/17/12
to nltk-...@googlegroups.com

Comment #7 on issue 604 by johannes...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

Hi and thanks,

I'm almost done, I created the directory as you told
(/home/johannes/nltk_data/stemmers/GermanWortschatzLemmatizer) and had to
sudo chmod 777 /home/johannes/nltk_data/ -R
so that the script could create the baseform.db

I copied 'baseform.db' to the folder
/usr/local/lib/python2.7/dist-packages/nltk_data/stemmers/GermanWortschatzLemmatizer/
and then also
/usr/local/lib/python2.7/dist-packages/nltk/stem/
and tried to chmod it, but it cannot find the file 'baseform.db'.

Can you again give me a hint where to put it?

nl...@googlecode.com

unread,
Feb 17, 2012, 7:09:35 AM2/17/12
to nltk-...@googlegroups.com

Comment #8 on issue 604 by philippn...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

hello,

you might try the following. open python

>> import nltk.data
>> nltk.data.find("")

then you should see the first folder where the nltk package assumes
nltk_data to be.
If the file resides in the corresponding directory the following should not
throw a LookupError

>> import nltk.data, os
>> nltk.data.find(os.path.join("stemmers","GermanWortschatzLemmatizer", "baseform.db"))

hope this helps

nl...@googlecode.com

unread,
Feb 18, 2012, 7:23:04 AM2/18/12
to nltk-...@googlegroups.com

Comment #9 on issue 604 by johannes...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

Hi,

>>> import nltk.data


Could not find baseform db ... Running Build Script

It doesn't let me import nltk.data even :-(

nl...@googlecode.com

unread,
Feb 18, 2012, 9:10:26 AM2/18/12
to nltk-...@googlegroups.com

Comment #10 on issue 604 by philippn...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

you might try to comment out the import you added to the __init__.py in
stem. ಠ_ಠ

nl...@googlecode.com

unread,
Feb 18, 2012, 11:23:12 AM2/18/12
to nltk-...@googlegroups.com

Comment #11 on issue 604 by johannes...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

Hi,
I removed the line from <nltk-install-directory>/stem/__init__.py
and then got the message as you predicted:

LookupError:
**********************************************************************
Resource '' not found. Please use the NLTK Downloader to obtain
the resource: >>> nltk.download().
Searched in:
- '/home/johannes/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'


So I used the first directory /home/johannes/nltk_data
I created it, along with the subfolders
/home/johannes/nltk_data/stemmers/GermanWortschatzLemmatizer/
and moved baseform.db there.

In the file /usr/local/lib/python2.7/dist-packages/nltk/stem/__init__.py, I
kept commented out:
# from GermanWortschatzLemmatizer import *

Launched python 2.7


from nltk.stem import GermanWortschatzLemmatizer as gwl

Then I continued with the instructions on http://goo.gl/cABPy

And finally it works :-)


Thanks for your continous help!


nl...@googlecode.com

unread,
Feb 19, 2012, 12:18:21 PM2/19/12
to nltk-...@googlegroups.com

Comment #12 on issue 604 by johannes...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

Me again … I'm having a problem with Unicode.

>>> token1 = u'DANN'
>>> token1
u'DANN'
>>> gwl.lemmatize(token1)
'DANN'

>>> token2 = u'FR\xc4ULEIN'
>>> token2
u'FR\xc4ULEIN'
>>> gwl.lemmatize(token2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>

File "/usr/local/lib/python2.7/dist-packages/nltk/stem/GermanWortschatzLemmatizer.py",
line 66, in lemmatize
derived = (derived.decode(_INPUT_ENCODING)).encode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in
position 2: ordinal not in range(128)

>>> token3 = "FRÄULEIN"
>>> token3
'FR\xc3\x84ULEIN'
>>> gwl.lemmatize("FRÄULEIN")
'FR\xc3\x84ULEIN'


Can you tell me how to convert
u'FR\xc4ULEIN' into
"FRÄULEIN" instead of into
'FR\xc3\x84ULEIN'
?




nl...@googlecode.com

unread,
Feb 19, 2012, 1:45:39 PM2/19/12
to nltk-...@googlegroups.com

Comment #13 on issue 604 by philippn...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

the script tries to guess the INPUT and OUTPUT encoding when you import the
lemmatizer. since you explicitly assign a unicode-string to your token that
guess might be wrong. if you want to work with explicit encoding settings
you should use the setInputEncoding(encoding) &
setOutputEncoding(encoding)methods the lemmatizer offers, before you start
to work.

aside from that I'd suggest you to read the GermanWortschatzLemmatizer.py
because thats what I just did to figure out the problem. It is less than
100 lines of code and pretty much self explanatory.

nl...@googlecode.com

unread,
Feb 21, 2012, 9:41:40 AM2/21/12
to nltk-...@googlegroups.com

Comment #14 on issue 604 by johannes...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

So, my issue was about the python 2.7 defaultencoding, which is 'ascii'
instead of 'utf-8' (which I need).
So I did
sudo vi /etc/python2.7/sitecustomize.py
and added
sys.setdefaultencoding('utf-8')
at the bottom
(http://stackoverflow.com/questions/7105441/how-to-set-default-encoding-in-pythonsetdefaultencoding-function-does-not-exist
and http://farmdev.com/talks/unicode/)

It works now!

nl...@googlecode.com

unread,
Feb 27, 2012, 6:01:32 AM2/27/12
to nltk-...@googlegroups.com

Comment #15 on issue 604 by m.schlo...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

I tried to implement the german stemmer as well but I got some weird errors:
When I import the GermanWortschatzLemmatizer the first time the file dialog
opens and I point at the txt-file. But the file the database builder is
producing is called 'baseform.db.db' instead of 'baseform.db'. If I change
the name to the correct one I am able to import the Lemmatizer, but
whenever I try to lemmatize a word I get the following error:

File "/Users/.../Library/Python/2.7/lib/python/site-packages/nltk/stem/GermanWortschatzLemmatizer.py",
line
68, in lemmatize
base = db.get(derived,derived)
AttributeError: get

I am working on a mac 10.7 with python 2.7.1
Does anyone have a clue what I am doing wrong?

nl...@googlecode.com

unread,
Feb 27, 2012, 9:49:34 AM2/27/12
to nltk-...@googlegroups.com

Comment #16 on issue 604 by philippn...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

I'm not sure why this is happening and I can't test since I don't use a
mac. So the following is only guesswork. (I have no clue about the weird
naming issue)

Try this code in the python interpreter

>>> import anydbm
>>> db = anydbm.open('test', 'c')
>>> dir(db)

then you can see the attributes of an anydbm object. see if it
contains "get"
also you can try to play a litle like.

>>> db['test1'] = 'hello'
>>> db.get('test1','')

if the same error occurs again consider installing a python runtime from
python.org

http://python.org/download/

nl...@googlecode.com

unread,
Feb 27, 2012, 11:25:29 AM2/27/12
to nltk-...@googlegroups.com

Comment #17 on issue 604 by m.schlo...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

Thank you for this hint, but everything with anydbm seems to be ok.

"get" is there and when I use it, as you suggested in your second code
example, everything is working fine. Any other guesses?

nl...@googlecode.com

unread,
May 21, 2014, 7:38:19 AM5/21/14
to nltk-...@googlegroups.com

Comment #18 on issue 604 by m.schafz...@gmail.com: GermaNet Integration in
NLTK
http://code.google.com/p/nltk/issues/detail?id=604

On windows change "APPDATA" to "USERPROFILE" for default nltk_data folder
in GermanWortschatzDBBuilder.py:

APPDATA = os.environ['USERPROFILE']

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings
Reply all
Reply to author
Forward
0 new messages