Collecting nouns from wikipedia page - encoding errors?

34 views
Skip to first unread message

Brendan Griffen

unread,
Jun 14, 2013, 10:26:30 AM6/14/13
to pattern-f...@googlegroups.com
Hi,

I have the following code:

import os, sys; sys.path.insert(0, os.path.join("..", ".."))

from pattern.web import Wikipedia
from pattern.web import Google, plaintext
from pattern.web import SEARCH,URL
from pattern.en  import parse
engine = Wikipedia(language="en")
article = engine.search("alice in wonderland", cached=True, timeout=30)

for s in article.sections:
    print s.title.upper()
    para = s.content
    for word, tag in tag(para,tokenize=True,encoding = 'unicode'):
        if tag == "NN":
            print word,tag


But when I run this, I get the following unicode error:

TypeError: 'unicode' object is not callable. The output from my ipython is:

In [60]: word
Out[60]: u'.'

In [61]: tag
Out[61]: u'.

The stdout is:

ALICE'S ADVENTURES IN WONDERLAND
novel NN
girl NN
rabbit NN
hole NN
fantasy NN
world NN
tale NN
logic NN
story NN
popularity NN
nonsense NN
genre NN
narrative NN
course NN
structure NN
imagery NN
culture NN
literature NN
fantasy NN
genre NN

BACKGROUND


Changing to just for word, tag in tag(para) doesn't help either. Do you know what setting I have to use? I am using OSX 10.7.5, ipython 0.12.1 and Xcode 2.6.5. I also regularly get the error that it can't parse u\u2014 which I believe is a "-" dash. Is this a common error? The examples provided in the source code don't work for many wikipedia pages due to this encoding issue. Thanks.

Regards,
Brendan

Francis Nimick

unread,
Jun 17, 2013, 11:29:24 AM6/17/13
to pattern-f...@googlegroups.com
I'm pretty sure the problem is with this line:
 for word, tag in tag(para,tokenize=True,encoding = 'unicode'):

The problem is that you're assigning the output of the tag(...) function to the variable tag.  Then the next time through the loop, it tries to call the unicode object, since you overwrote the built-in tag function.  Change your variable to a different name and it should work.

Francis
Reply all
Reply to author
Forward
0 new messages