Assalamu alaikum,
You need to use something called OCR to extract the text from the PDF.
Here is an excerpt of an article i wrote about using OCR for Arabic:
In the OCR field, a tool called eScriptorium is currently the best at (my subjective opinion) scanning Arabic texts. It is an Open Source project built on top of another Open Source project called Kraken.
The models to be used for the eScriptorium are available here:
Printed Urdu Base Model Trained on the OpenITI Corpus
Printed Persian Base Model Trained on the OpenITI Corpus
Printed Ottoman Base Model Trained on the OpenITI Corpus
Printed Arabic Base Model Trained on the OpenITI Corpus
Printed Arabic-Script Base Model Trained on the OpenITI Corpus (here is a quote regarding the difference between this model and the Arabic model above: The former has been trained only on Arabic language prints, the latter is trained on multiple languages that all use the Arabic script (Arabic, Persian, Urdu, Ottoman).)
https://www.kentoseth.com/posts/2023/nov/10/update-2-ocr-cat-of-classical-arabic-works/
This is only necessary if the text is not indexed within the pdf.
What this means is that if you can select the text within the pdf, you may just need to convert the PDF to a text document (many online and free tools available for this) to obtain the text.
I hope this helps you.
Shukran,
--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAE3uUjBabw6JYUjZ2FeqCLrBhMdS9uS5cYmidEv16%3DpBqYrG4g%40mail.gmail.com.
-- Find me at: https://www.kentoseth.com https://fosstodon.org/web/@kentoseth
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/4afa33da-af61-4318-875e-cd2cfe1df92b%40devcroo.com.