Hi,
On 19/03/2021 10:11, Charles Cho wrote:
> Hello,
> I'm working on a ocr android app based on tesseract.
> I want to add feature that detects language automatically and recognize
> at least 2 languages at once.
> I have investigated on that for a while so I know that I have to specify
> language for tesseract.
> Then how can I implement auto detection of language?
Not exactly a mobile use case, but you can read how the Internet Archive
does this (I coined it "autonomous mode", where the software just
figures out the scripts and languages):
https://archive.org/services/docs/api/ocr.html#autonomous-mode
And the code is available, here (I plan to split out the
archive.org
specific code from the python code that invokes Tesseract and performs
heuristics like script detection):
https://git.archive.org/www/tesseract/-/blob/master/main.py#L757
the tl;dr is to first perform script detection, and use the detected
script to OCR the page - then use language detection libraries to guess
the languages on the page.
> And tesseract on google play store can recognize 3 languages at once.
> Is it maximum?
I am not sure what you're finding on google play store, but I have found
there to be no limitation to the amount of languages that can be used
during OCR. Keep in mind that using more languages will slow down the
OCR process.
> Any help and advice would be really appreciated.
Hope this helps.
Cheers,
Merlijn