--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f0b43596-ac01-47a5-bf1b-27cd0cf12b76%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
In language file spr_latn.tessdata (Serbian lating) there is a line
tessedit_load_sublangs srp
which means that tesseract loads srp (Serbian Cyrillic) language file.
As a result some of the text is recognized as cyrillic, even if the original text contains no cyrillic script at all!
Can this option be disabled in any way, or new language files provided without the "load sublangs" part?
$ combine_tessdata -o srp_latn.traineddata srp_latn.config
That will remove the problematic line from your config (you might want to copy srp_latn to srp_latn_only or some other name if you'd like both behaviors available to you).
Tom
Tried that, did not work for me either :)
Sorry, I tried the -c tessedit_load_sublangs="" option, which did not work.