help with Arabic language tokenizer

157 views
Skip to first unread message

sara ali

unread,
Jul 27, 2021, 3:46:55 PM7/27/21
to nltk-users
Hello 
I have   Arabic corpus that I want split to characters i.e. letters .i am using "nltk.word_tokenize". could someone kindly  please recommend a way to split Arabic text to characters.

Kind Regards 

Alexis

unread,
Jul 30, 2021, 1:54:00 AM7/30/21
to nltk-users
This should be handled well by Python itself, you do not need a tokenizer for letters.

1. Make sure you are using Python 3, not Python 2.7
2. Read in the file with the correct encoding.
3. If s is a string containing arabic, any iteration or list operation will process one character at a time. E.g.,

    letters = list(s) or
   for x in s:
      print(x)

Alexis
Reply all
Reply to author
Forward
0 new messages