english-arabic dictionary - transliteration text

aum hren

unread,

Mar 28, 2024, 10:45:39 AM3/28/24

to tesseract-ocr

olo company

i am trying to ocr an old (1963) morocco arabic - english dictionary

i have tried jTessBoxEditor for ocr, somehow managed to follow the info on net,

but at the very end tesseract failed to make final _traindata_ files

my problem is

the book (dictionary) is basically in english language, so i used eng file for ocr-ing

but there is also transliteration text, which includes characters that are not present in english language

although they are latin script

i tried to train the tesseract for those characters, but failed

ie from this link:

https://www.youtube.com/watch?v=8GdcyknL1ls

the other info i could find is also a bit confusing

the characters i was trying to train are letters

g z d h r t s l - with dots below and above, plus

š ž and a weird semi question mark

transliteration script is also _italic_

with help of libre office writer and some trial & error i also managed to identify a (close approximation) of the transliteration font (Latin Modern Roman Unslanted)

can somebody versed in tesseract-ocr training help me train (or do the ocr) for those letters/characters ?

attached are:

- my train script / font image (font - latin modern roman unslanted)

- a page from a dictionary which includes most of the characters i am trying to ocr

dictionary has 500+ pages, half is eng-morocco arabic, the other half is morocco arabic-eng, so proper ocr would be truly appreciated

thank you for your help

have fun

aum

page-294a.png

FONT.tiff

aum hren

unread,

Mar 28, 2024, 5:34:25 PM3/28/24

to tesseract-ocr

with a help of this webpage :

https://towardsdatascience.com/simple-ocr-with-tesseract-a4341e4564b6

i did manage - with enormous addition of improvisation, trial & error, stubbornness and blind LMB-clicking, including another 2 failed attempts - to produce xyz.traineddata

i ran

$ tesseract -l xyz list.txt a.new.txt

and got catastrophic ocr results, far worse than with plain eng.traineddata, which actually did fairly good job, after all - all english text is ocr-ed correctly, and also the transliteration italic text is ocr-ed good-up-to-the-point, with exception of above mentioned characters (those that are not in english latin script)

oh well, i guess _manually_ is the way to fix those ...

but if somebody knows how to improve ocr to the point where those dotted characters are also recognized, it would make this world much better place

have fun

aum

Tom Morris

unread,

Mar 29, 2024, 11:46:35 AM3/29/24

to tesseract-ocr

Rather than using random web resources, I'd suggest using the official documentation. The most relevant section is probably this:

https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters

I would suggest starting with script/Latin for your base model, which will at least give you š ž to start with. In addition to the consonants with dots above and below, it looks like there's also a funny Epsilon style character that you may want to train (perhaps similar to https://unicodeplus.com/U+0190).

You may also want to think about whether it'd be better to train with synthetic rendered lines of text or line images chopped out of your page scans with associated ground truth text. If you decide to go with the latter approach, looking at what the Fraktur OCR project did for training may be useful https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR