[Arabic LLM] Arabic PDF Parsing with Tashkill

Maher Jendoubi

unread,

May 9, 2024, 6:10:47 AM5/9/24

to sig...@googlegroups.com

Dear Community,

I possess a PDF file in Arabic containing Tashkil.

What is the process for converting this file into Markdown format?

Thank you.

Best regards,

Maher

Mohamed H.

unread,

May 9, 2024, 6:38:37 AM5/9/24

to sig...@googlegroups.com

Assalamu alaikum,

You need to use something called OCR to extract the text from the PDF.

Here is an excerpt of an article i wrote about using OCR for Arabic:

In the OCR field, a tool called eScriptorium
 is currently the best at (my subjective opinion) scanning Arabic texts.
 It is an Open Source project built on top of another Open Source 
project called Kraken.

The models to be used for the eScriptorium are available here:

Printed Urdu Base Model Trained on the OpenITI Corpus

Printed Persian Base Model Trained on the OpenITI Corpus

Printed Ottoman Base Model Trained on the OpenITI Corpus

Printed Arabic Base Model Trained on the OpenITI Corpus

Printed Arabic-Script Base Model Trained on the OpenITI Corpus (here is a quote regarding the difference between this model and the Arabic model above: The
 former has been trained only on Arabic language prints, the latter is 
trained on multiple languages that all use the Arabic script (Arabic, 
Persian, Urdu, Ottoman).)

https://www.kentoseth.com/posts/2023/nov/10/update-2-ocr-cat-of-classical-arabic-works/

This is only necessary if the text is not indexed within the pdf.

What this means is that if you can select the text within the pdf, you may just need to convert the PDF to a text document (many online and free tools available for this) to obtain the text.

I hope this helps you.

Shukran,

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAE3uUjBabw6JYUjZ2FeqCLrBhMdS9uS5cYmidEv16%3DpBqYrG4g%40mail.gmail.com.

-- 
Find me at:
https://www.kentoseth.com
https://fosstodon.org/web/@kentoseth

Maher Jendoubi

unread,

May 10, 2024, 4:19:02 AM5/10/24

to Mohamed H., sig...@googlegroups.com

Wa Alaikom Assalam Mohamed,

Thank you for this suggestion.

I don't need an OCR because the text doesn't contain images and it is not handwritten.

I used the following python script:

import nest_asyncio
import os
import logging
from llama_parse import LlamaParse
from time import sleep

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Apply asyncio patches
nest_asyncio.apply()

# Configure API key
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-*********************************************$"

max_retries = 3
retry_delay = 5  # seconds

for attempt in range(max_retries):
    try:
        logging.info(f"Attempt {attempt + 1}: Starting to load the document.")
        documents = LlamaParse(result_type="markdown").load_data("AlBayan.pdf")

        if documents:
            logging.info("Document loaded successfully, starting to write to markdown.")
            with open("AlBayan.md", "w", encoding="utf-8") as file:
                file.write(documents[0].text)
            logging.info("The extracted Markdown has been saved to AlBayan.md")
            break  # Exit the retry loop if successful
        else:
            logging.warning("No documents were parsed. Please check the PDF file and try again.")

    except Exception as e:
        logging.error(f"An error occurred: {str(e)}")
        if attempt < max_retries - 1:
            logging.info(f"Retrying in {retry_delay} seconds...")
            sleep(retry_delay)
        else:
            logging.error("Max retries reached. Exiting.")

Shukran,

Maher

To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/4afa33da-af61-4318-875e-cd2cfe1df92b%40devcroo.com.

Capture d’écran 2024-05-10 à 10.12.54.png

Reply all

Reply to author

Forward