Add German abbreviation with a blank ("z. B.") to PunktSentenceTokenizer

88 views

Skip to first unread message

Philip Gillißen

unread,

Oct 27, 2021, 8:22:17 AM10/27/21

to nltk-users

Dear all,

I'm currently trying to optimize the ntlk PunktSentenceTokenizer in my project. I'm applying the sentence tokenizer to German texts and it split ups sentences incorrectly, based on German abbreviations like "ca.", "dt.", and "z.B.". Those cases I can manage via customized abbreviation types [1]

What I don't get to work is the abbreviation "z. B." with a blank between the two parts. According to the Duden, it's correct to spell it with a blank.

Unfortunately, adding this via sentence_tokenizer._params.abbrev_types.update("z. B") does not work, as it seems to be not recognized as one token.

I consider adding it as a collocation, but not sure if that's the right way to go.

Can anybody advice me how to add this abbreviation as an exception for the sentence tokenizer?

Thanks in advance,

Philip

[1] https://stackoverflow.com/questions/69734355/configure-punktsentencetokenizer-and-specify-language

Reply all

Reply to author

Forward

0 new messages