help with Arabic language tokenizer

157 views

Skip to first unread message

sara ali

unread,

Jul 27, 2021, 3:46:55 PM7/27/21

to nltk-users

Hello

I have Arabic corpus that I want split to characters i.e. letters .i am using "nltk.word_tokenize". could someone kindly please recommend a way to split Arabic text to characters.

Kind Regards

Alexis

unread,

Jul 30, 2021, 1:54:00 AM7/30/21

to nltk-users

This should be handled well by Python itself, you do not need a tokenizer for letters.

1. Make sure you are using Python 3, not Python 2.7

2. Read in the file with the correct encoding.

3. If s is a string containing arabic, any iteration or list operation will process one character at a time. E.g.,

letters = list(s) or

for x in s:

print(x)

Alexis

Reply all

Reply to author

Forward

0 new messages