Extender letter recognized as underline for arabic text

77 views
Skip to first unread message

Sifdin Nahhas

unread,
Nov 20, 2023, 9:36:57 AM11/20/23
to tesseract-ocr
Hey guys,
so i have problem where tesseract remove Extender letter in arabic "ـ" because it recognize it as underline like the images bellow
i think it because of some configuration varaibles but i could not find the responsable one

appreciate the help.
sample_5.png
src.processed.png

Dellu Bw

unread,
Nov 20, 2023, 9:44:52 AM11/20/23
to tesser...@googlegroups.com
Can you try to remove it from the list of punctuations?

To do that, you need to extract the components of the traineddata file, edit the ara.punc file, and then recombine them. 

To extract the components: combine_tessdata -d ara.traineddata


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/911e8ef4-68f3-4e9d-b40b-e7a715ab912cn%40googlegroups.com.
Message has been deleted

Sifdin Nahhas

unread,
Dec 4, 2023, 8:12:21 AM12/4/23
to tesseract-ocr
the extended character it's not in the  ara.punc
Reply all
Reply to author
Forward
0 new messages