Extraction of two different language text from single image using tesseract

Pankaj Gupta

unread,

Aug 13, 2020, 3:15:05 PM8/13/20

to tesseract-ocr

Dear Team,

Me and team is developing a tool that extract the text from the given images (containing data related to single language) using tesseract/ The tool is able to extract the text in 14 different languages with a higher accuracy greater than 95%.

We have got a new challenge in the development that there are images that contain text in more than one language (Japanese - English or Arabic - English). due to copyright issues, I am not able to attach the original image, A sample image is attached along with this thread which contains text in Japanese and English depicting the actual scenarios. Request your support in identifying the technique to extract the text accurately in both the language.

I am using Python 3+, open CV, and tesseract for development.

Thanks in advance.

Regards,

Pankaj Gupta

SampleImage.PNG

Pankaj Gupta

unread,

Aug 19, 2020, 3:50:19 AM8/19/20

to tesseract-ocr

Dear Team,

Waiting for your suggestions. Need your help.

Thank you in advance.

Regards,

Pankaj

Shree Devi Kumar

unread,

Aug 19, 2020, 4:10:14 AM8/19/20

to tesseract-ocr

For multiple languages the standard invocation is to use the two language codes with + sign.

Eg. -l ara+eng or -l eng+jpn

Alternately you can also try the script traineddata files eg. Devanagari includes eng+hin+san+mar+nep

However, multiple languages recognition takes more time and is not perfect.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cc03edb3-b96b-477f-9b31-fe7e4a0ccb4cn%40googlegroups.com.

Pankaj Gupta

unread,

Aug 19, 2020, 4:35:00 AM8/19/20

to tesseract-ocr

Thank your for suggestions.

Pankaj Gupta

unread,

Aug 19, 2020, 1:03:29 PM8/19/20

to tesseract-ocr

Hi Shree,

Thank you for your suggestion. As per the suggested method, it improves the pass percentage of the test cases. but the consistency of the extraction of mixed language text is not up to the mark. Some times tesseract is able to extract the characters correctly but not all the time.

e.g. in one of the scenarios, it is able to detect English alphabets that come at the start of the text but in the next text, the English alphabet coming at the end of the text is not getting extracted properly.

One more problem we have identified that in a few of the images we have numbers present in the superscripts, while applying OCR, the superscripts numbers are not getting extracted.

Please suggest.

Devarti Mahakalkar

unread,

Dec 8, 2021, 7:25:46 AM12/8/21

to tesseract-ocr

Hi Pankaj,

Could you please share your approach for using more than one language in tesseract with good accuracy if you found any?

Thank you!

Reply all

Reply to author

Forward