I am writing to seek your advice on extracting Arabic text from PDF files. In this work, we extract Arabic text from PDFs and systematically evaluate challenges that arise when documents contain non-text elements.
We use multilingual-pdf2text and Pytesseract as our primary packages. Both achieve high accuracy on the majority of our corpus; however, they consistently fail to extract non-text content, such as the elements shown in the attached screenshots. Through our analysis, we identified three recurring issues:
Paragraph order errors: Extracted paragraphs sometimes appear in incorrect positions relative to the surrounding content. For example, the attached image 1 is extracted in the output .txt file as follows:
“ددوا وقاربوا أنه لن يدخل أحدكم عمله الجنة وأن” even without the full passage and surrounded with messy text extracted with incorrect order.
Qur'anic text extraction accuracy: Extraction accuracy remains limited. In particular, Pytesseract performs poorly on Qur'anic fonts with dense diacritical marks. For example, the attached image 2 is extracted in the output .txt file as follows:
“(فبدل الين لوا ولد الذي قيل لهخ)”.
First-line noise: Noise frequently appears in the first extracted line when they appear after non-text content.
Try Surya:
https://github.com/datalab-to/surya
The LLMs are also fairly decent on structured texts, so those might work also.
Shukran,
--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/52c66e18-4791-4226-8a50-0e8df497a507n%40googlegroups.com.
-- Find me at: https://www.kentoseth.com https://fosstodon.org/web/@kentoseth

Samhaa R. El-Beltagy
Professor of Computer Science
--