Hi,
Im new to tesseract and have a pdf file with diacritical marks. I tried to run tesseract 4.0.0 with language eng. I see that it is not able to recognize the text with diacritical marks. I found a font that can detect diacritical mark.
I tried to extract the fonts files and copied to /home/tesseract/Downloads/fonts
Whenever i try to run tesstrain.sh it gives me an error "could not find font named gandhariunicode"
./tesstrain.sh --fontlist 'gandhariunicode' --fonts_dir /home/tesseract/Downloads/fonts/ --lang eng --langdata_dir /usr/local/share/tessdata/ --overwrite
=== Starting training for language 'eng'
[Mon Aug 28 23:18:12 PDT 2017] /usr/local/bin/text2image --fonts_dir=/home/tesseract/Downloads/fonts/ --font=gandhariunicode --outputbase=/tmp/font_tmp.C9vSySTfge/sample_text.txt --text=/tmp/font_tmp.C9vSySTfge/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.C9vSySTfge
Could not find font named gandhariunicode.
Pango suggested font Gandhari Unicode.
Please correct --font arg.
=== Phase I: Generating training images ===
ERROR: Could not find training text file /usr/local/share/tessdata//eng/eng.training_text
What could the issue please let me know. Thanks in advance.
Thanks,
Anand