Detecting language automatically

444 views
Skip to first unread message

Charles Cho

unread,
Mar 19, 2021, 5:17:51 AM3/19/21
to tesseract-ocr
Hello,
I'm working on a ocr android app based on tesseract.
I want to add feature that detects language automatically and recognize at least 2 languages at once.
I have investigated on that for a while so I know that I have to specify language for tesseract.
Then how can I implement auto detection of language?
And tesseract on google play store can recognize 3 languages at once.
Is it maximum?
Any help and advice would be really appreciated.
Thanks.

Merlijn B.W. Wajer

unread,
Mar 20, 2021, 8:29:13 PM3/20/21
to tesser...@googlegroups.com
Hi,

On 19/03/2021 10:11, Charles Cho wrote:
> Hello,
> I'm working on a ocr android app based on tesseract.
> I want to add feature that detects language automatically and recognize
> at least 2 languages at once.
> I have investigated on that for a while so I know that I have to specify
> language for tesseract.
> Then how can I implement auto detection of language?

Not exactly a mobile use case, but you can read how the Internet Archive
does this (I coined it "autonomous mode", where the software just
figures out the scripts and languages):

https://archive.org/services/docs/api/ocr.html#autonomous-mode

And the code is available, here (I plan to split out the archive.org
specific code from the python code that invokes Tesseract and performs
heuristics like script detection):

https://git.archive.org/www/tesseract/-/blob/master/main.py#L757

the tl;dr is to first perform script detection, and use the detected
script to OCR the page - then use language detection libraries to guess
the languages on the page.

> And tesseract on google play store can recognize 3 languages at once.
> Is it maximum?

I am not sure what you're finding on google play store, but I have found
there to be no limitation to the amount of languages that can be used
during OCR. Keep in mind that using more languages will slow down the
OCR process.

> Any help and advice would be really appreciated.

Hope this helps.

Cheers,
Merlijn

Charles Cho

unread,
Mar 21, 2021, 10:28:38 PM3/21/21
to tesseract-ocr
Hi, Merlijn.

Thanks for your kind response.

Regarding autonomous mode, I'm trying to find such module for Android.
But I found nothing. I will try more.

>I am not sure what you're finding on google play store, but I have found
>there to be no limitation to the amount of languages that can be used
>during OCR. Keep in mind that using more languages will slow down the
>OCR process.

Your response is really helpful.

Best,
Charles.

Charles Cho

unread,
Mar 25, 2021, 4:41:42 AM3/25/21
to tesseract-ocr
Hi,

I have investigated on trying to detect language automatically.
I referred to these links. Thank you, Merlijin.

So in my analysis, it used OSD of tesseract engine to detect layout and script.
After detect script, it detects languages on the script.

So I tried to use OSD engine mode based on textfairy which is Android OCR app based on tesseract 4.1.1.
But it doesn't work and I can't make sure how I can use OSD engine mode in Android.
I set 'osd' as language option string and used osd.traindata and set 'OEM_OSD_ONLY' as engine mode.
But it doesn't work.

Hope anyone can help you to use OSD engine mode in Android.

Thank you.
Best,
Charles.

shree

unread,
Mar 25, 2021, 9:49:10 AM3/25/21
to tesseract-ocr
See https://github.com/tesseract-ocr/tessdoc/blob/master/examples/OSD_example.cc

//Get OSD - new code
    int orient_deg;
    float orient_conf;
    const char* script_name;
    float script_conf;
    api->DetectOrientationScript(&orient_deg, &orient_conf, &script_name, &script_conf);
    printf("************\n Orientation in degrees: %d\n Orientation confidence: %.2f\n"
    " Script: %s\n Script confidence: %.2f\n",
    orient_deg, orient_conf,
    script_name, script_conf);

Charles Cho

unread,
Mar 25, 2021, 2:04:45 PM3/25/21
to tesseract-ocr
Hi.

Thank you very much for your kind help, shree.
I tried to detect script by your help and it worked. Great.

I have some questions.
1. If the image contains texts of different languages in a page, is there any way to detect all of the languages? Now it detects only one language.
2. It detects English, German, French as 'Latin'. So how can I distinguish the languages exactly?

Thanks.
Best,
Charles.

Merlijn B.W. Wajer

unread,
Mar 25, 2021, 2:33:26 PM3/25/21
to tesser...@googlegroups.com
Hi,

On 25/03/2021 19:04, Charles Cho wrote:
> Hi.
>
> Thank you very much for your kind help, shree.
> I tried to detect script by your help and it worked. Great.
>
> I have some questions.
> 1. If the image contains texts of different languages in a page, is there
> any way to detect all of the languages? Now it detects only one language.
> 2. It detects English, German, French as 'Latin'. So how can I distinguish
> the languages exactly?

The OSD module does not detect language - it detect script, as you also
noted earlier:

>>> So in my analysis, it used OSD of tesseract engine to detect layout and
>>> script.
>>> After detect script, it detects languages on the script.

What's missing is performing OCR using just the script - and then
analysing the corpus to detect the language.

You could use something like this: https://github.com/saffsd/langid.c

Regards,
Merlijn

Charles Cho

unread,
Mar 25, 2021, 9:55:53 PM3/25/21
to tesseract-ocr
Hi, 

>>>The OSD module does not detect language - it detect script, as you also
>>>noted earlier:
It detects language by using OSD in tesseract and tesseract also provides DetectOrientationScript function.

api.Init("/Users/renard/devel/textfairy/tessdata", "osd", tesseract::OcrEngineMode::OEM_DEFAULT);
api.SetPageSegMode(tesseract::PageSegMode::PSM_OSD_ONLY);
api.SetImage(pix);
api.DetectOrientationScript(&orient_deg, &orient_conf, &script_name, &script_conf);  

After this, script_name will get language name and script_conf will get confidence value.
As I tested several languages, scipt_name gets following values.
English -> 'Latin'
French->'Latin'
German->'Latin'
Chinese_Sim -> 'Han'
Chinese_Tra -> 'Han'
Korean -> 'Korean'
Japanese -> 'Japanese'
Russian -> 'Cyrillic'

So the problem is that I want to distinguish Latin languages exactly and I want to  detects several languages once from an image.

Thanks.
Best,
Charles.

redsto...@163.com

unread,
Aug 6, 2022, 6:50:04 AMAug 6
to tesseract-ocr
Have you solved the problem?
Reply all
Reply to author
Forward
0 new messages