Tokenizers

18 views
Skip to first unread message

Adham Ahmed

unread,
Jul 4, 2024, 6:41:27 AMJul 4
to sig...@googlegroups.com
Hi everyone, 

I write this requesting any useful resources for Arabic sentence tokenization. I have a large corpus that I need to split into sentences. I have tried NLTK's tokenizer and Pyarabic as well but they seem inaccurate and split into paragraphs rather than sentences. 

Any suggestions or guidance would be greatly appreciated.

Kind regards everyone, 
Adham 

Adham Ahmed

unread,
Jul 4, 2024, 6:43:38 AMJul 4
to sig...@googlegroups.com
I need to split the corpus into sentences in order to sample sentences for some target words. So if there are any suggestions on how to tackle this it would be greatly appreciated as well. 

Thank you in advance. 

Ahmed, H.I.A.A. (Hossam)

unread,
Jul 4, 2024, 8:08:59 AMJul 4
to Adham Ahmed, sig...@googlegroups.com

Hi Adham,

Depending on what you mean by “sentence,” it may be easier simply to write your own code ( perhaps with regular expressions) based your exact definition of a sentence.  Libraries are often generic and will require pre- or post- processing.

Good luck

Hossam

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CADF0CJgeNtnm-%2B0LRRFr1UzPEhgtqaEs8aR_tXixF4Kdg8yGLA%40mail.gmail.com.

Mohamed H.

unread,
Jul 4, 2024, 11:52:50 AMJul 4
to sig...@googlegroups.com

Assalamu alaikum,

Are you using a dialect or MSA or Classical?

I had a discussion about this on SIGARAB some months back.

Professor Sane Yagi and his research team worked on an ML solution that can detect sentence boundaries.

I would reference that thread and look up the research they did.

As brother Hossam said, you can also write your own solution if your corpus has English/other detectable punctuation.

Shukran,
Mohamed

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CADF0CJgeNtnm-%2B0LRRFr1UzPEhgtqaEs8aR_tXixF4Kdg8yGLA%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages