Re: [tesseract-ocr] Incorrect recognition of Latin words inside Arabic text

Message has been deleted

Zdenko Podobny

unread,

Sep 2, 2022, 2:35:29 PM9/2/22

to tesser...@googlegroups.com

Please stop abusing the tesseract forum. Why are you sending the same email again and again?

Zdenko

pi 2. 9. 2022 o 20:24 Naourass Derouichi <naou...@gmail.com> napísal(a):

Hi all, I'm trying to ocr images similar to the attached one, but the error rate of Latin words is too high.

I tried all PSMs with the following models from tessdata_best: ara, eng, fra, Ara (in different orders). I even tried finetuning them on the font used in the input images.

Sample output (error in bold):
قرارلمجلس المنافسة عدد 0028/ق/2022 صبادر25 من شعبان 1443
(28 مارس 2022) والمتعلق بتولي الشركة القابضة للمساهمات
والاستثمارات «11010108-:2م1]» للمر اقبة المشتركة على شركة
‎«CMGP Group Sa»‏ وذلك عبراقتناء نسبة14,81 96 من أسيم
رأسمالها وحقوق التصويت المرتبطة به.

The results often have incorrect recognition of Latin words. Is there any solution to this issue?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5610a81a-a1d9-4b0d-bbc5-1c2cd60d4239n%40googlegroups.com.

Naourass Derouichi

unread,

Sep 2, 2022, 2:52:37 PM9/2/22

to tesseract-ocr

Sorry everyone, I didn't know that an email is distributed for each new post. I forgot to attach the image and didn't find out how to edit the post so I created a new one. This is my first time using this forum. Have a good day :)

Reply all

Reply to author

Forward