How does tesseract work with multiple languages text?

1,502 views

Skip to first unread message

Layne Wang

unread,

Jun 7, 2018, 4:36:03 AM6/7/18

to tesseract-ocr

Hi,

I'm working on segmenting different languages from an image, so I wonder how tesseract choose the output character when we give multiple languages in the command line.

So far, what I know:

The lstm model in traineddata for different languages are different, I cannot combine the traineddata easily.
The sequence of the language command matters. For example, -eng+fra and -fra+eng will give different results. And the first language passed is set as primary, which affects the output spacing.

I would like to know:

How does tesseract choose the output character when it is in different languages? Is it based on the confidence score? And how does the "primary" play a role in generating the output?

Thank you!

Layne

ps. I posted the same content early today but could not see my post showing in the group. Appreciate someone could tell me the reason.

ShreeDevi Kumar

unread,

Jun 7, 2018, 9:44:04 AM6/7/18

to tesser...@googlegroups.com

Please see https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865

for details of debug variables you can set to see the values of different languages.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2d5e257e-3ebc-4d47-bbc4-2ba40bd5f35d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages