Collecting nouns from wikipedia page - encoding errors?

34 views

Skip to first unread message

Brendan Griffen

unread,

Jun 14, 2013, 10:26:30 AM6/14/13

to pattern-f...@googlegroups.com

Hi,

I have the following code:

import os, sys; sys.path.insert(0, os.path.join("..", ".."))

from pattern.web import Wikipedia

from pattern.web import Google, plaintext

from pattern.web import SEARCH,URL

from pattern.en import parse

engine = Wikipedia(language="en")

article = engine.search("alice in wonderland", cached=True, timeout=30)

for s in article.sections:

print s.title.upper()

para = s.content

for word, tag in tag(para,tokenize=True,encoding = 'unicode'):

if tag == "NN":

print word,tag

But when I run this, I get the following unicode error:

TypeError: 'unicode' object is not callable. The output from my ipython is:

In [60]: word

Out[60]: u'.'

In [61]: tag

Out[61]: u'.

The stdout is:

ALICE'S ADVENTURES IN WONDERLAND

novel NN

girl NN

rabbit NN

hole NN

fantasy NN

world NN

tale NN

logic NN

story NN

popularity NN

nonsense NN

genre NN

narrative NN

course NN

structure NN

imagery NN

culture NN

literature NN

fantasy NN

genre NN

BACKGROUND

Changing to just for word, tag in tag(para) doesn't help either. Do you know what setting I have to use? I am using OSX 10.7.5, ipython 0.12.1 and Xcode 2.6.5. I also regularly get the error that it can't parse u\u2014 which I believe is a "-" dash. Is this a common error? The examples provided in the source code don't work for many wikipedia pages due to this encoding issue. Thanks.

Regards,

Brendan

Francis Nimick

unread,

Jun 17, 2013, 11:29:24 AM6/17/13

to pattern-f...@googlegroups.com

I'm pretty sure the problem is with this line:

for word, tag in tag(para,tokenize=True,encoding = 'unicode'):

The problem is that you're assigning the output of the tag(...) function to the variable tag. Then the next time through the loop, it tries to call the unicode object, since you overwrote the built-in tag function. Change your variable to a different name and it should work.