Tesseract OCR 4.0.0 Alpha how to train a new font

Anand Akella

unread,

Aug 29, 2017, 2:49:10 AM8/29/17

to tesseract-ocr

Hi,

Im new to tesseract and have a pdf file with diacritical marks. I tried to run tesseract 4.0.0 with language eng. I see that it is not able to recognize the text with diacritical marks. I found a font that can detect diacritical mark.

Gandhari Unicode 5.1

I tried to extract the fonts files and copied to /home/tesseract/Downloads/fonts

Whenever i try to run tesstrain.sh it gives me an error "could not find font named gandhariunicode"

./tesstrain.sh --fontlist 'gandhariunicode' --fonts_dir /home/tesseract/Downloads/fonts/ --lang eng --langdata_dir /usr/local/share/tessdata/ --overwrite

=== Starting training for language 'eng'
[Mon Aug 28 23:18:12 PDT 2017] /usr/local/bin/text2image --fonts_dir=/home/tesseract/Downloads/fonts/ --font=gandhariunicode --outputbase=/tmp/font_tmp.C9vSySTfge/sample_text.txt --text=/tmp/font_tmp.C9vSySTfge/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.C9vSySTfge
Could not find font named gandhariunicode.
Pango suggested font Gandhari Unicode.
Please correct --font arg.

=== Phase I: Generating training images ===
ERROR: Could not find training text file /usr/local/share/tessdata//eng/eng.training_text

What could the issue please let me know. Thanks in advance.

Thanks,

Anand

ShreeDevi Kumar

unread,

Aug 29, 2017, 2:53:49 AM8/29/17

to tesser...@googlegroups.com

Try first with

best/Latin.traineddata

that should handle text with diacritics

-----------

>>Pango suggested font Gandhari Unicode.

Use "Gandhari Unicode" within quotes as Font name

>>ERROR: Could not find training text file /usr/local/share/tessdata//eng/eng.training_text

give script_dir link to langdata folder where you have your training text

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ca874bc1-1458-49da-bf07-005aacd7d582%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

shree

unread,

Sep 5, 2017, 2:13:13 AM9/5/17

to tesseract-ocr

Try

san_latn.traineddata from https://github.com/Shreeshrii/tessdata4alpha/tree/master/best

Reply all

Reply to author

Forward