We need your advice on text extraction task

25 views
Skip to first unread message

Salsabil Abdalbaki

unread,
Jan 29, 2026, 6:25:25 AMJan 29
to SIGARAB: Special Interest Group on Arabic Natural Language Processing

I am writing to seek your advice on extracting Arabic text from PDF files. In this work, we extract Arabic text from PDFs and systematically evaluate challenges that arise when documents contain non-text elements.

We use multilingual-pdf2text and Pytesseract as our primary packages. Both achieve high accuracy on the majority of our corpus; however, they consistently fail to extract non-text content, such as the elements shown in the attached screenshots. Through our analysis, we identified three recurring issues:

  • Paragraph order errors: Extracted paragraphs sometimes appear in incorrect positions relative to the surrounding content. For example, this image 

image.png

is extracted in the output .txt file as follows:

ددوا وقاربوا أنه لن يدخل أحدكم عمله الجنة وأن” even without the full passage and surrounded with messy text extracted with incorrect order.

  • Qur'anic text extraction accuracy: Extraction accuracy remains limited. In particular, Pytesseract performs poorly on Qur'anic fonts with dense diacritical marks. For example, this image

image.png

is extracted in the output .txt file as follows:

(فبدل الين لوا ولد الذي قيل لهخ)

  • First-line noise: Noise frequently appears in the first extracted line when they appear after non-text content.


Any guidance or leads you could provide to help us complete this extraction task would be highly appreciated.

Thanks!

Salsabil Abdalbaki

unread,
Feb 4, 2026, 5:55:37 AM (8 days ago) Feb 4
to Hadda CHERROUN, SIGARAB: Special Interest Group on Arabic Natural Language Processing
Thanks a lot for the suggestions!

On Fri, 30 Jan 2026 at 13:55, Hadda CHERROUN <hadda_c...@lagh-univ.dz> wrote:
Salem
Kaoutar keep this discussion as we will need it for ASJP  algerian papers that we will analyse 
best

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/9e4a802f-d6f6-4a2b-8f0d-ead4b67a0f82n%40googlegroups.com.


--

Prof. Hadda CHERROUN
Département d’Informatique
Chef d'équipe Evaluation, Modélisation et Optimisation,  Labotatoire LIM
Université Amar Télidji
BP. 37G Route de Ghardaia M’Kam
03000 Laghouat, Algérie.
Mobile: +213 (0) 6  98  73 89 08
Tél +213 (0)  29 14.53.00  

Fax: +213 (0) 29 14.53.00E-mail: hadda_cherroun (at) lagh-univ.dz
WWW: http://perso.lagh-univ.dz/~hcherroun
Reply all
Reply to author
Forward
0 new messages