olo company
i am trying to ocr an old (1963) morocco arabic - english dictionary
i have tried jTessBoxEditor for ocr, somehow managed to follow the info on net,
but at the very end tesseract failed to make final _traindata_ files
my problem is
the book (dictionary) is basically in english language, so i used eng file for ocr-ing
but there is also transliteration text, which includes characters that are not present in english language
although they are latin script
i tried to train the tesseract for those characters, but failed
ie from this link:
the other info i could find is also a bit confusing
the characters i was trying to train are letters
g z d h r t s l - with dots below and above, plus
š ž and a weird semi question mark
transliteration script is also _italic_
with help of libre office writer and some trial & error i also managed to identify a (close approximation) of the transliteration font (Latin Modern Roman Unslanted)
can somebody versed in tesseract-ocr training help me train (or do the ocr) for those letters/characters ?
attached are:
- my train script / font image (font - latin modern roman unslanted)
- a page from a dictionary which includes most of the characters i am trying to ocr
dictionary has 500+ pages, half is eng-morocco arabic, the other half is morocco arabic-eng, so proper ocr would be truly appreciated
thank you for your help
have fun
aum