problem detected using tesseract4 & arabic data

182 views
Skip to first unread message

El Fakir Zakaria

unread,
Mar 9, 2017, 8:11:49 AM3/9/17
to tesseract-ocr
I noticed that tesseract4 reads الأ as األ which is pretty close, because we need to switch the position of the last 2 letters to have ا ل أ, this happens with similar word forms too like لا reads as ال and should be ل ا, and i wish to correct it.
can someone show me how to fix this, or maybe update arabic data.
thank you for your time.

Ray Smith

unread,
Mar 29, 2017, 6:34:32 PM3/29/17
to tesseract-ocr
Thanks for spotting this!
I understand why it makes this error, but it will take some thought to fix it properly!
It is using a sort by x-position to re-order the boxes for RTL language training, but that doesn't work in the case of heavily kerned characters like ل in your example.
It needs to simply reverse the RTL characters, but has to avoid messing up the order of the common script, which is why I was using a sort to begin with.

El Fakir Zakaria

unread,
Mar 29, 2017, 8:13:50 PM3/29/17
to tesser...@googlegroups.com
thank you for your concern over this matter, your work is really important and much appreciated.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d993e1d4-1978-40f8-9917-331613925457%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages