I am writing to seek your advice on extracting Arabic text from PDF files. In this work, we extract Arabic text from PDFs and systematically evaluate challenges that arise when documents contain non-text elements.
We use multilingual-pdf2text and Pytesseract as our primary packages. Both achieve high accuracy on the majority of our corpus; however, they consistently fail to extract non-text content, such as the elements shown in the attached screenshots. Through our analysis, we identified three recurring issues:
Paragraph order errors: Extracted paragraphs sometimes appear in incorrect positions relative to the surrounding content. For example, this image
is extracted in the output .txt file as follows:
“ددوا وقاربوا أنه لن يدخل أحدكم عمله الجنة وأن” even without the full passage and surrounded with messy text extracted with incorrect order.
Qur'anic text extraction accuracy: Extraction accuracy remains limited. In particular, Pytesseract performs poorly on Qur'anic fonts with dense diacritical marks. For example, this image
is extracted in the output .txt file as follows:
“(فبدل الين لوا ولد الذي قيل لهخ)”
First-line noise: Noise frequently appears in the first extracted line when they appear after non-text content.
SalemKaoutar keep this discussion as we will need it for ASJP algerian papers that we will analysebest
--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/9e4a802f-d6f6-4a2b-8f0d-ead4b67a0f82n%40googlegroups.com.
---Prof. Hadda CHERROUNDépartement d’InformatiqueChef d'équipe Evaluation, Modélisation et Optimisation, Labotatoire LIMFax: +213 (0) 29 14.53.00E-mail: hadda_cherroun (at) lagh-univ.dz
Université Amar Télidji
BP. 37G Route de Ghardaia M’Kam
03000 Laghouat, Algérie.
Mobile: +213 (0) 6 98 73 89 08
WWW: http://perso.lagh-univ.dz/~hcherroun