Re: [nltk-users] fcfg parsing weirdness

88 views
Skip to first unread message

peter ljunglöf

unread,
Jun 5, 2012, 9:23:29 AM6/5/12
to nltk-...@googlegroups.com, nltk...@googlegroups.com
Aha, I think you've stumbled upon a bug! So I cc the dev list too.

Apparently the default logic parser (nltk.sem.logic.LogicParser) requires that the argument is a str object, not unicode:

$ grep -n 'str)' logic.py
87: assert isinstance(name, str), "%s is not a string" % name
271: assert isinstance(type_string, str)
1748: assert isinstance(expr, str), "%s is not a string" % expr
1759: assert isinstance(expr, str), "%s is not a string" % expr
1770: assert isinstance(expr, str), "%s is not a string" % expr

And since you used .decode("UTF-8"), you get a unicode string. Unfortunately there's not much you can do until the bug is fixed, except not converting to unicode...

best,
Peter

31 maj 2012 kl. 02:46 skrev Mat Bettinson:

> Here's a strange one.
>
> cfg = nltk.data.load('file:zhongwen_zhi.fcfg','fcfg',verbose=True,cache=False)
>
> This works okay but the following equivalent:
>
> grammarrules = open("zhongwen_zhi.fcfg").read().decode("UTF-8")
> cfg = nltk.parse_fcfg(grammarrules)
>
> ... produces "AssertionError: first is not a string"
>
> on this grammar line:
>
> N[SEM=<first>] -> 'xian'
>
> Which makes me think calling parse_fcfg(string) is not equivalent to nltk.data.load('file:blah','fcfg').
>
> It should be, shouldn't it?
>
> --
> Regards,
>
> Mat Bettinson
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

peter ljunglöf

unread,
Jun 5, 2012, 9:27:19 AM6/5/12
to nltk...@googlegroups.com
I've added this as issue #267:

https://github.com/nltk/nltk/issues/267

One simple solution would be to simply replace all these "isinstance(..., str)" with "isinstance(..., basestring)". Is that enough, or are there more things that need to be changed?

/Peter
> You received this message because you are subscribed to the Google Groups "nltk-dev" group.
> To post to this group, send email to nltk...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-dev+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-dev?hl=en.
>

Mikhail Korobov

unread,
Jun 5, 2012, 10:29:20 AM6/5/12
to nltk...@googlegroups.com
There are more things need to be changed. E.g. the first assert is about self.name of Variable. variable.name is used in __str__ and __repr__. These methods must return byte strings under Python 2.x so unicode in self.name can lead to an exception when Variable will be printed or casted to str (e.g. implicitly while building other object's repr).

This problem is not specific to nltk.sem. E.g. nltk.Text is unable to use unicode for the same reason: it needs byte strings in __repr__ and __str__.

The issue is that the encoding __repr__ should return in UNKNOWN. We may say 'well, __str__ should always return utf8' and that'll be fine (and I think we should do this - but not quite sure, this may break print under Windows). But __repr__ is different - it should be possible to have meaningful repr of an object in console but console encodings is such a mess. We may use sys.stdout.encoding in __repr__ but this may break __str__ of container types (that uses %r format specifier) and this would break outputting to files.

Another possibility is to give up and return non-ascii symbols escaped (e.g. self.__unicode__().encode('unicode-escape')) under Python 2.x in __repr__ (and maybe __str__) so that they won't blow up anything. This would be no worse than current behavior becuse currently non-ascii symbols in __repr__ may work only by occasion.


вторник, 5 июня 2012 г., 19:27:19 UTC+6 пользователь peter ljunglöf написал:
>> To unsubscribe from this group, send email to nltk-users+unsubscribe@googlegroups.com.
>> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "nltk-dev" group.
> To post to this group, send email to nltk...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-dev+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages