How does tesseract work with multiple languages text?

1,502 views
Skip to first unread message

Layne Wang

unread,
Jun 7, 2018, 4:36:03 AM6/7/18
to tesseract-ocr
Hi, 

I'm working on segmenting different languages from an image, so I wonder how tesseract choose the output character when we give multiple languages in the command line. 

So far, what I know: 
  • The lstm model in traineddata for different languages are different, I cannot combine the traineddata easily.
  • The sequence of the language command matters. For example, -eng+fra and -fra+eng will give different results. And the first language passed is set as primary, which affects the output spacing.
I would like to know:
  • How does tesseract choose the output character when it is in different languages? Is it based on the confidence score? And how does the "primary" play a role in generating the output?
Thank you!
Layne

ps. I posted the same content early today but could not see my post showing in the group. Appreciate someone could tell me the reason.

ShreeDevi Kumar

unread,
Jun 7, 2018, 9:44:04 AM6/7/18
to tesser...@googlegroups.com

for details of debug variables you can set to see the values of different languages.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2d5e257e-3ebc-4d47-bbc4-2ba40bd5f35d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages