How to optimize tesseract to maximum speed for single number (several digits) recognition

123 views
Skip to first unread message

Jan Pohanka

unread,
Jan 29, 2019, 11:48:35 AM1/29/19
to tesseract-ocr
Hello,

I'm making a simple device used to recognize numbers on pictures taken by a webcam. All is running on raspberry pi 3.
Everything is like following simple loop (in python for simplicity, but using C++ api it is the same), images are preprocessed to black and white

api = PyTessBaseAPI(psm=tesserocr.PSM.SINGLE_WORD)

for im in images:
    api.SetImage(im)
    api.SetSourceResolution(70)
    ot = api.GetUTF8Text()

api.End()

My problem is that api.GetUTF8Text() call is quite slow and more over it is getting slower and slower over time. Is there any options how to make recognition faster? I have tried to resize the image to around 50x10px. The times starts on around 300ms but then goes up to above 1s which is too slow for me. I tried both legacy and LSTM algorithms, but they are similar.

best regards
Jan

Lorenzo Bolzani

unread,
Jan 29, 2019, 12:08:49 PM1/29/19
to tesser...@googlegroups.com

First double check if the Pi is not throttling due to overheating or lack of USB power. This may cause the slowdown.

Usually 30/50 px of text height is fine. IF the problem is tesseract, try to use the fast model (or "normal" if using best). I assume you are using the 4.x release.

Try tesseract -v to see if you are using all the available CPU optimizations.

Try to move the SetSourceResolution outside the loop and see if it changes something (MAYBE it may invalidate some caches or something).

The time you are referring to is one single api.GetUTF8Text() call, correct?


Lorenzo


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a53b4b25-97e3-47dc-823a-cbb219225eed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jan Pohanka

unread,
Jan 29, 2019, 2:40:08 PM1/29/19
to tesseract-ocr
Thanks for suggestions. You are right that I'm reffering to api.GetUTF8Text() call, it is my bottleneck.
I was not aware that there is some fast and best models in tesseract 4.0, I will give it a try. So far I used just lang=eng or osd.
For me it is suspicious that the calls are getting longer during the time. Or to be more precise, first 10-15 calls are up to 500ms and latter ones rise above 1s...
SetSourceResolution outside of the loop gives no change unfortunately.

BR
Jan

Dne úterý 29. ledna 2019 18:08:49 UTC+1 Lorenzo Blz napsal(a):

Zdenko Podobny

unread,
Jan 30, 2019, 1:48:23 AM1/30/19
to tesser...@googlegroups.com
What is your tesseract version?

Zdenko


ut 29. 1. 2019 o 20:40 Jan Pohanka <xhpo...@gmail.com> napísal(a):

Jan Pohanka

unread,
Jan 30, 2019, 1:51:18 AM1/30/19
to tesseract-ocr
It is 4.0. I'm satisfied with recognition results, but I need to make it faster (at constant times below 1s)...

Dne středa 30. ledna 2019 7:48:23 UTC+1 zdenop napsal(a):

Zdenko Podobny

unread,
Jan 30, 2019, 1:57:07 AM1/30/19
to tesser...@googlegroups.com
search issue tracker for "speed"...

Zdenko


st 30. 1. 2019 o 7:51 Jan Pohanka <xhpo...@gmail.com> napísal(a):

Jan Pohanka

unread,
Jan 30, 2019, 2:09:04 AM1/30/19
to tesseract-ocr
I have already done that but haven't found anything interesting.
I tried to ask here if there are eg. any part of algorithms that can be disabled etc. The image is preprocessed, binarized and contain only 8 digits (and point). I was also a bit surprised that resizing image from 400px to 50px has given only subtle speed up.

I will try the fast model today (if I find how to switch it), maybe it will help.

here are my measured times
ocr time: 0.980876922607
ocr time: 0.435426950455
ocr time: 0.76907491684
ocr time: 0.836761951447
ocr time: 0.871710062027
ocr time: 0.803520917892
ocr time: 0.371052026749
ocr time: 0.732284069061
ocr time: 0.745162010193
ocr time: 0.836426019669
ocr time: 0.740739107132
ocr time: 0.379159927368
ocr time: 0.798940181732
ocr time: 0.3972260952
ocr time: 0.739762067795
ocr time: 0.7757999897
ocr time: 0.772871017456
ocr time: 0.435608863831
ocr time: 0.770547866821
ocr time: 0.870738983154
ocr time: 0.37126493454
ocr time: 0.837875127792
ocr time: 0.811723947525
ocr time: 0.865257024765
ocr time: 0.79048204422
ocr time: 0.435704946518
ocr time: 0.763910055161
ocr time: 0.391008853912
ocr time: 0.396636009216
ocr time: 0.38174700737
ocr time: 0.809095144272
ocr time: 0.773195028305
ocr time: 0.427488088608
ocr time: 0.403608083725
ocr time: 0.806233167648
ocr time: 0.948635101318
ocr time: 0.900885105133
ocr time: 0.829130887985
ocr time: 0.932774782181
ocr time: 1.09788799286
ocr time: 0.520708799362
ocr time: 0.448786973953
ocr time: 0.560626983643
ocr time: 0.993177175522
ocr time: 0.48442697525
ocr time: 1.1292309761
ocr time: 1.04695606232
ocr time: 0.8810338974
ocr time: 1.10285806656
ocr time: 1.05213713646
ocr time: 1.22593903542
ocr time: 1.04618191719
ocr time: 1.11645102501
ocr time: 1.05435395241
ocr time: 1.15162396431
ocr time: 0.547721862793
ocr time: 0.607867956161
ocr time: 1.14074802399
ocr time: 1.1790971756
ocr time: 1.18815803528
ocr time: 0.58503985405
ocr time: 1.10898280144
ocr time: 1.22723913193
ocr time: 1.2178709507
ocr time: 1.28540086746
ocr time: 1.28237104416
ocr time: 1.56176805496
ocr time: 1.2859480381
ocr time: 1.2599170208
ocr time: 1.42588591576
ocr time: 1.51333785057
ocr time: 1.34276986122
ocr time: 1.34283900261
ocr time: 1.39351201057
ocr time: 1.61450195312
ocr time: 1.44723105431
ocr time: 1.63176107407
ocr time: 0.82429599762
ocr time: 1.08239603043
ocr time: 0.755813121796
ocr time: 1.63984704018
ocr time: 1.84553313255
ocr time: 0.958009958267
ocr time: 1.52479290962
ocr time: 0.919597864151

thanks
Jan

Dne středa 30. ledna 2019 7:57:07 UTC+1 zdenop napsal(a):

Lorenzo Bolzani

unread,
Jan 30, 2019, 5:15:19 AM1/30/19
to tesser...@googlegroups.com

Jan Pohanka

unread,
Jan 30, 2019, 5:19:43 AM1/30/19
to tesseract-ocr
You were right, I just found that my RPi is throttling. It explains the slowing down. Now I'm checking if heatsink could help.
So I expect that there is nothing to tune up in my loop. I will check if I can try some smaller model.

best regards
Jan


Dne středa 30. ledna 2019 11:15:19 UTC+1 Lorenzo Blz napsal(a):
Reply all
Reply to author
Forward
0 new messages