limiting tesseract to one language

438 views
Skip to first unread message

Bojan Djuric

unread,
Mar 6, 2016, 7:21:26 AM3/6/16
to tesseract-ocr
In language file spr_latn.tessdata (Serbian lating) there is a line
tessedit_load_sublangs srp
which means that tesseract loads srp (Serbian Cyrillic) language file.

As a result some of the text is recognized as cyrillic, even if the original text contains no cyrillic script at all!

Can this option be disabled in any way, or new language files provided without the "load sublangs" part?

(Older version of this language file did not have this line.)

Thank you.

zdenko podobny

unread,
Mar 6, 2016, 11:45:20 AM3/6/16
to tesser...@googlegroups.com
Can you please make issue in tessdata part of project[1] and provide (simple) test image?

Thanks,


Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f0b43596-ac01-47a5-bf1b-27cd0cf12b76%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,
Mar 6, 2016, 2:27:27 PM3/6/16
to tesseract-ocr
On Sunday, March 6, 2016 at 7:21:26 AM UTC-5, Bojan Djuric wrote:
In language file spr_latn.tessdata (Serbian lating) there is a line
tessedit_load_sublangs srp
which means that tesseract loads srp (Serbian Cyrillic) language file.

As a result some of the text is recognized as cyrillic, even if the original text contains no cyrillic script at all!

Can this option be disabled in any way, or new language files provided without the "load sublangs" part?

I was hoping you'd be able to override that on the command line, using -c tessedit_load_sublangs="", but that doesn't seem to work with the current order of evaluation, at least with my limited testing.

If you have the training tools installed, you can patch your copy of the language file by doing the following:

$ combine_tessdata -e srp_latn.traineddata srp_latn.config
$ cp /dev/null srp_latn.config

$ combine_tessdata -o srp_latn.traineddata srp_latn.config


That will remove the problematic line from your config (you might want to copy srp_latn to srp_latn_only or some other name if you'd like both behaviors available to you).


Tom

Bojan Djuric

unread,
Mar 7, 2016, 3:36:16 AM3/7/16
to tesseract-ocr

Bojan Djuric

unread,
Mar 7, 2016, 3:39:42 AM3/7/16
to tesseract-ocr
Tried that, did not work for me either :)
Workaround could be to copy srp (cyrillic), and osd files to another folder, and use --tessdata-dir parameter.
But that would complicate things.

Tom Morris

unread,
Mar 7, 2016, 11:18:36 AM3/7/16
to tesser...@googlegroups.com
On Mon, Mar 7, 2016 at 3:39 AM, Bojan Djuric <dboj...@gmail.com> wrote:
Tried that, did not work for me either :)

I mentioned two things. Which one(s) did you try? If you tried editing/replacing the config file in srp_latn.traineddata and it didn't work, you can provide more details on your exact steps and the results?

Bojan Djuric

unread,
Mar 8, 2016, 6:03:31 AM3/8/16
to tesseract-ocr

Sorry, I tried the  -c tessedit_load_sublangs="" option, which did not work.

Tom Morris

unread,
Mar 8, 2016, 11:03:46 AM3/8/16
to tesser...@googlegroups.com
On Tue, Mar 8, 2016 at 6:03 AM, Bojan Djuric <dboj...@gmail.com> wrote:

Sorry, I tried the  -c tessedit_load_sublangs="" option, which did not work.

Yes, I said that didn't work. I'd suggest trying the workaround that I said would work, namely, unpacking the config file from srp_latn.traineddata, editing it to remove the offending line, and repacking it.  The necessary commands are in my original message below.

Tom
Reply all
Reply to author
Forward
0 new messages