parse function with tokenize = False

13 views
Skip to first unread message

Enrique Manjavacas

unread,
Jan 13, 2015, 10:53:01 AM1/13/15
to pattern-f...@googlegroups.com
Hey Tom,

I noticed a somehow strange behaviour in the way "parse" works.
It is supposed to take a string but if passing "tokenize=False" it does accept a list as arg.

So basically,
>>> parse("varkentje")
u'varkentje/NN/B-NP/O'

>>> parse(["varkentje"])
TypeError: expected string or buffer

>>> parse(["varkentje"], tokenize=False)
u'varkentje/NN/B-NP/O'

I was wondering if this asymmetry in the input parameters is the desired behaviour.
Thanks again for pattern!


Tom De Smedt

unread,
Jan 13, 2015, 12:06:45 PM1/13/15
to pattern-f...@googlegroups.com
Hi Enrique,

With tokenize=False, the input is expected to be tokenized in advance with a custom tokenizer; so the input is expected to be a list of sentences where each sentence is a list of words. This is what the built-in tokenize() function would also return.

from pattern.nl import parse, tokenize

print parse("Zeer vreemd!")

def my_tokenizer(s):
    """ Returns a list of sentences, where each sentence is a list of words (tokens).
    """
    s = s.replace(",", " ,")
    s = s.replace(".", " .")
    s = s.replace("!", " !")
    s = s.split(".")
    s = [sentence.strip().split(" ") for sentence in s]
    return s

print parse(my_tokenizer("Zeer vreemd!"), tokenize=False)

Perhaps it would have been better if tokenize() returned a string with newlines, but changing it now would break backwards compatibility.

Best,
Tom
Reply all
Reply to author
Forward
0 new messages