Issue with OCR rapidity on short texts

94 views

Skip to first unread message

Quentin MIGNOT

unread,

Sep 10, 2021, 1:41:29 PM9/10/21

to tesseract-ocr

Hello everyone,

I am using Tesseract OCR library in a professional program where speed is quite important. We receive pictures (movie subtitles) containing characters that we need to decode (one possible treatment among many others). However, we have issues when we try to decode longer subtitles or subtitles in chinese language, the library takes too much time. I could use some help to see if there is something to do or configure to improve detection speed, knowing I am working on quite powerfull servers, with a lot of cores.

I wrote a little program to help with my testing. It loads Tesseract, and perform character detection on a picture, then displays the result with the time it took.

I tried it with the three datasets I found on github repo for chinese language :

https://github.com/tesseract-ocr/tessdata

https://github.com/tesseract-ocr/tessdata_best

https://github.com/tesseract-ocr/tessdata_fast

The first works with OEM_TESSERACT_ONLY and OEM_LSTM_ONLY modes, the two others only work wil OEM_LSTM_ONLY.

Here are the result I get with this input picture, which is one of the "difficult" cases we have:

In the first case (normal traineddata + OEM_TESSERACT), the detected text is correct enough, but the time it took is too high :

./testTesseract traineddata/legacy chi_tra Tesseract pngs/test.png

OCR loaded

Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 4.1389s, decoded : "銅鐵人，通過我得考慮看看".

In the second case (normal traineddata + OEM_LSTM), the detected text is not as good but much faster.

./testTesseract traineddata/legacy chi_tra LSTM pngs/test.png

OCR loaded

Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 0.348507s, decoded : "鍘鐵人，通過我得考慧看看".

In the third case (fast traineddata), the result is catastrophic (but fast)

./testTesseract traineddata/fast chi_tra LSTM pngs/test.png

OCR loaded

Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 0.150667s, decoded : "5 折才2起起二".

In the fourth case (best traineddata), the result is quite bad too:

./testTesseract traineddata/best chi_tra LSTM pngs/test.png

Error opening data file traineddata/best/chi_tra_vert.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Failed loading language 'chi_tra_vert'

OCR loaded

Picture pngs/test.png loaded : w = 438, h = 160, d = 32. Took 0.379917s, decoded : "全說喲基 E 選 5 所才下二起生".

My questions are :

Is there a reason why the "best" and "fast" training sets perform so poorly ? Maybe I configured something wrong ?
Does Tesseract has a feature for multithreading (I suppose it does not, as I did not find any reference to it online) ?
For the "normal" training set, Tesseract mode is correct but slow. LSTM mode is less correct but much faster. Is there a way to have something in the middle, with another training set or a specific configuration ?

One of the things I tried was using openmp, which is supposed to improve multithreading with Tesseract (which makes me believe there is some sort of multithreading). I recompiled Tesseract, linked it again with our program but did not see any difference with our OCR performances. It is still possible that we failed to configure the omp_thread_limit variable correctly, given the complexity of out build system. My program is compiled with openmp support, but changing the variable in my shell does not seem to change anything... Is there a way to check if it is correctly detected ?

I'd be very glad to get a little bit of help here. My team and I are about to change our "Legacy" OEM_TESSERACT mode for OEM_LSTM, as it is quite easy to change and quite faster. However if there is a better solution, please tell me :)