Add German abbreviation with a blank ("z. B.") to PunktSentenceTokenizer

61 views
Skip to first unread message

Philip Gillißen

unread,
Oct 27, 2021, 8:22:17 AM10/27/21
to nltk-users
Dear all,

I'm currently trying to optimize the ntlk PunktSentenceTokenizer in my project. I'm applying the sentence tokenizer to German texts and it split ups sentences incorrectly, based on German abbreviations like "ca.", "dt.", and "z.B.". Those cases I can manage via customized abbreviation types [1]

What I don't get to work is the  abbreviation "z. B." with a blank between the two parts. According to the Duden, it's correct to spell it with a blank.
Unfortunately, adding this via sentence_tokenizer._params.abbrev_types.update("z. B") does not work, as it seems to be not recognized as one token.

I consider adding it as a collocation, but not sure if that's the right way to go.
Can anybody advice me how to add this abbreviation as an exception for the sentence tokenizer?

Thanks in advance,
Philip

Reply all
Reply to author
Forward
0 new messages