parse function with tokenize = False

13 views

Skip to first unread message

Enrique Manjavacas

unread,

Jan 13, 2015, 10:53:01 AM1/13/15

to pattern-f...@googlegroups.com

Hey Tom,

I noticed a somehow strange behaviour in the way "parse" works.

It is supposed to take a string but if passing "tokenize=False" it does accept a list as arg.

So basically,

>>> parse("varkentje")

u'varkentje/NN/B-NP/O'

>>> parse(["varkentje"])

TypeError: expected string or buffer

>>> parse(["varkentje"], tokenize=False)

u'varkentje/NN/B-NP/O'

I was wondering if this asymmetry in the input parameters is the desired behaviour.

Thanks again for pattern!

Tom De Smedt

unread,

Jan 13, 2015, 12:06:45 PM1/13/15

to pattern-f...@googlegroups.com

Hi Enrique,

With tokenize=False, the input is expected to be tokenized in advance with a custom tokenizer; so the input is expected to be a list of sentences where each sentence is a list of words. This is what the built-in tokenize() function would also return.

from pattern.nl import parse, tokenize

print parse("Zeer vreemd!")

def my_tokenizer(s):
    """ Returns a list of sentences, where each sentence is a list of words (tokens).
    """
    s = s.replace(",", " ,")
    s = s.replace(".", " .")
    s = s.replace("!", " !")
    s = s.split(".")
    s = [sentence.strip().split(" ") for sentence in s]
    return s

print parse(my_tokenizer("Zeer vreemd!"), tokenize=False)

Perhaps it would have been better if tokenize() returned a string with newlines, but changing it now would break backwards compatibility.

Best,

Tom

Reply all

Reply to author

Forward

0 new messages