We need your advice on text extraction task

29 views
Skip to first unread message

Salsabil Abdalbaki

unread,
Jan 29, 2026, 7:20:45 AMJan 29
to SIGARAB: Special Interest Group on Arabic Natural Language Processing
Dear all, 

Apologies for resending this message, but it seems like the pictures were not attached correctly.

I am writing to seek your advice on extracting Arabic text from PDF files. In this work, we extract Arabic text from PDFs and systematically evaluate challenges that arise when documents contain non-text elements.

We use multilingual-pdf2text and Pytesseract as our primary packages. Both achieve high accuracy on the majority of our corpus; however, they consistently fail to extract non-text content, such as the elements shown in the attached screenshots. Through our analysis, we identified three recurring issues:

  • Paragraph order errors: Extracted paragraphs sometimes appear in incorrect positions relative to the surrounding content. For example, the attached image 1 is extracted in the output .txt file as follows:

    ددوا وقاربوا أنه لن يدخل أحدكم عمله الجنة وأن” even without the full passage and surrounded with messy text extracted with incorrect order.

  • Qur'anic text extraction accuracy: Extraction accuracy remains limited. In particular, Pytesseract performs poorly on Qur'anic fonts with dense diacritical marks. For example, the attached image 2 is extracted in the output .txt file as follows:

    (فبدل الين لوا ولد الذي قيل لهخ)”.

  • First-line noise: Noise frequently appears in the first extracted line when they appear after non-text content.

Any guidance or leads you could provide to help us complete this extraction task would be highly appreciated.

Thanks!

Salsabil Abdalbaki
PhD student, School of Politics & International Relations (SPIRe), University College Dublin
Research Assistant in Warrior project.
Image 2.PNG
Image 1.png

Mohamed H.

unread,
Jan 29, 2026, 6:11:23 PM (14 days ago) Jan 29
to sig...@googlegroups.com

Try Surya:

https://github.com/datalab-to/surya

The LLMs are also fairly decent on structured texts, so those might work also.

Shukran,

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sigarab/52c66e18-4791-4226-8a50-0e8df497a507n%40googlegroups.com.
-- 
Find me at:
https://www.kentoseth.com
https://fosstodon.org/web/@kentoseth

Samhaa El-Beltagy

unread,
Jan 30, 2026, 6:20:25 AM (13 days ago) Jan 30
to Salsabil Abdalbaki, SIGARAB: Special Interest Group on Arabic Natural Language Processing
In my experience, gemini models are the best in processing visual information. I would try gemini flash 2 or 3. Gemini offers a decent free request quota that you can use to test.  Please make sure you provide a good prompt and that you upload the file to gemini (consult the api reference).  Here is a simple test using the web interface:

image.png

Good luck, 

Samhaa R. El-Beltagy
Professor of Computer Science 

Newgiza University (NGU)
Newgiza, km 22 Cairo-Alex Desert Rd
Cairo, Egypt


--
Reply all
Reply to author
Forward
0 new messages