[Arabic LLM] Arabic PDF Parsing with Tashkill

780 views
Skip to first unread message

Maher Jendoubi

unread,
May 9, 2024, 6:10:47 AM5/9/24
to sig...@googlegroups.com
Dear Community,

I possess a PDF file in Arabic containing Tashkil.
What is the process for converting this file into Markdown format?

Thank you.

Best regards,
Maher

Mohamed H.

unread,
May 9, 2024, 6:38:37 AM5/9/24
to sig...@googlegroups.com

Assalamu alaikum,

You need to use something called OCR to extract the text from the PDF.

Here is an excerpt of an article i wrote about using OCR for Arabic:

In the OCR field, a tool called eScriptorium
 is currently the best at (my subjective opinion) scanning Arabic texts.
 It is an Open Source project built on top of another Open Source 
project called Kraken.
The models to be used for the eScriptorium are available here:

https://www.kentoseth.com/posts/2023/nov/10/update-2-ocr-cat-of-classical-arabic-works/

This is only necessary if the text is not indexed within the pdf.

What this means is that if you can select the text within the pdf, you may just need to convert the PDF to a text document (many online and free tools available for this) to obtain the text.

I hope this helps you.

Shukran,

--
You received this message because you are subscribed to the Google Groups "SIGARAB: Special Interest Group on Arabic Natural Language Processing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sigarab+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sigarab/CAE3uUjBabw6JYUjZ2FeqCLrBhMdS9uS5cYmidEv16%3DpBqYrG4g%40mail.gmail.com.
-- 
Find me at:
https://www.kentoseth.com
https://fosstodon.org/web/@kentoseth

Maher Jendoubi

unread,
May 10, 2024, 4:19:02 AM5/10/24
to Mohamed H., sig...@googlegroups.com
Wa Alaikom Assalam Mohamed,

Thank you for this suggestion.

I don't need an OCR because the text doesn't contain images and it is not handwritten.

I used the following python script:

import nest_asyncio
import os
import logging
from llama_parse import LlamaParse
from time import sleep

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Apply asyncio patches
nest_asyncio.apply()

# Configure API key
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-*********************************************$"

max_retries = 3
retry_delay = 5 # seconds

for attempt in range(max_retries):
try:
logging.info(f"Attempt {attempt + 1}: Starting to load the document.")
documents = LlamaParse(result_type="markdown").load_data("AlBayan.pdf")

if documents:
logging.info("Document loaded successfully, starting to write to markdown.")
with open("AlBayan.md", "w", encoding="utf-8") as file:
file.write(documents[0].text)
logging.info("The extracted Markdown has been saved to AlBayan.md")
break # Exit the retry loop if successful
else:
logging.warning("No documents were parsed. Please check the PDF file and try again.")

except Exception as e:
logging.error(f"An error occurred: {str(e)}")
if attempt < max_retries - 1:
logging.info(f"Retrying in {retry_delay} seconds...")
sleep(retry_delay)
else:
logging.error("Max retries reached. Exiting.")

Shukran,
Maher

Capture d’écran 2024-05-10 à 10.12.54.png
Reply all
Reply to author
Forward
0 new messages