pytesseract speed improvement?

86 views
Skip to first unread message

Jean-Marc Spaggiari

unread,
May 21, 2025, 12:28:52 PMMay 21
to tesseract-ocr
Hi,

I'm using tesseract to convert a small picture containing a title into a string. It runs in about one second.
Here is the command line I'm using:
pytesseract.image_to_string(cropped_image, nice=-10, config='--psm 7 --oem 1 -l eng+fra+spa+deu+ita+por+jpn+kor+rus+chi_sim+chi_tra')

I have millions of those small pictures to process. I'm wondering if there is a way to make that faster. Can I keep tesseract in memory and "stream" the pictures to it?  I'm receiving the pictures one by one on a server, so I can't batch them.

I tried to to remove the -l parameter and it's way faster (98ms), but then the title is totally wrong. I'm wondering if the time is taken to load those dictionnaries, so I can pre-load them and keep them in memory, or it's more on the processing time.

Thanks,

JMS

Tom Morris

unread,
May 22, 2025, 3:48:22 PMMay 22
to tesseract-ocr
On Wednesday, May 21, 2025 at 12:28:52 PM UTC-4 jean...@spaggiari.org wrote:
I'm using tesseract to convert a small picture containing a title into a string. It runs in about one second.
Here is the command line I'm using:
pytesseract.image_to_string(cropped_image, nice=-10, config='--psm 7 --oem 1 -l eng+fra+spa+deu+ita+por+jpn+kor+rus+chi_sim+chi_tra')

A small semantic distinction - tesseract and pytesseract are two different things, maintained by different teams.
 
I tried to to remove the -l parameter and it's way faster (98ms), but then the title is totally wrong. I'm wondering if the time is taken to load those dictionnaries, so I can pre-load them and keep them in memory, or it's more on the processing time.

Certainly every language model that you add is going to increase processing time, so you only want to load the ones that you really need, but I don't think you have the granularity of control with pytesseract to save significantly on initialization time. It appears to just use command line tesseract running in a subprocess. 

One thing which may cut down on overhead is collecting batch of images, saving them in a multi-image file format, and then have Tesseract process that.

Tom

Jean-Marc Spaggiari

unread,
May 22, 2025, 4:36:10 PMMay 22
to tesser...@googlegroups.com
Hi Tom,

Thanks for having a look at this. The challenge is that I don't know which of those languages the title is using. 

Let me remove pytesseract from the picture.

If I run tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu+ita+por+jpn+kor+rus+chi_sim+chi_tra it takes 0.9 second and returns the right title ("Advance Scout")

The title is in English.

If I run tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu it's faster (0,3s) and the title is still correct.
If I run tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu it's even faster (0.25) but the title is wrong ("AVEO Segue")
If I run tesseract title.jpg stdout --psm 7 --oem 1 -l eng it's crazy fast! (0,09s) but title is wrong again ("clyzinee Segue")
If I use just "deu" it's super fast and correct.

I can't batch the pictures as the client is waiting for the reply before sending the next one.

So I was thinking about running each of them in parallel. I'm able to get a reply in 300ms! Thats 3 times faster, and it gives me this:
clyzinee Segue
ANVanee Scout
AVEO EU
Advance Scout:
YAVanicc Sco
Advance So ui
eV2pe22)らの016
여00200606 20600ㄷ
Ао\алее Эсодиь
二司多5
和NOU2COCOUUE


But then I don't know which one I should take from those. I see the one from DEU is the good one. But I don't have a way to confirm that in the script.

So multiple questions here.
- Can tesseract work like a shell? I send a picture, I get the txt. I send a picture, I get the text. Without ever closing tesseract?
- Can I get the "confidence" level for each of those predictions? It might help to figure which one is the most probable?

Thanks,

JMS





--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/77af7499-6271-4135-982b-4b2fd1ee27d9n%40googlegroups.com.

Zdenko Podobny

unread,
May 23, 2025, 1:41:54 AMMay 23
to tesser...@googlegroups.com

št 22. 5. 2025 o 22:35 Jean-Marc Spaggiari <jean...@spaggiari.org> napísal(a):

TheComplete BookOfMormon

unread,
May 23, 2025, 6:57:02 AMMay 23
to tesser...@googlegroups.com
In C# I would use Tesseract.net and create an engine once (that has a cost), then I would process each page using the already-created engine. That should at least save *some* processing time.

You'd need to have a process constantly running, so it would either need to watch a folder for incoming images or it would need to serve HTTP requests

To start with, I'd test the speed difference by writing code that does this
1: Discover all files in a folder
2: Create the engine
3: Start a timer
4: Process each file
5: Stop the timer, and output the elapsed time

Then try creating the engine per file (as part of step 4) and see how that affects the total time. Then decide if it's worth making the change or not.




Tom Morris

unread,
May 23, 2025, 1:17:21 PMMay 23
to tesseract-ocr
That's odd that you get better results with the German model for English text. That might be worth investigating to see if there's something amiss with your pre-processing or something else.

On Thursday, May 22, 2025 at 4:36:10 PM UTC-4 jean...@spaggiari.org wrote:

If I run tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu it's faster (0,3s) and the title is still correct.
If I run tesseract title.jpg stdout --psm 7 --oem 1 -l eng+fra+spa+deu it's even faster (0.25) but the title is wrong ("AVEO Segue")

Aren't these two commands the same?
 

So multiple questions here.
- Can tesseract work like a shell? I send a picture, I get the txt. I send a picture, I get the text. Without ever closing tesseract?

Using tesserocr, the Python wrapper for the Tesseract API that Zdenko pointed to, you have full control of the processing and how you decompose it.
 
- Can I get the "confidence" level for each of those predictions? It might help to figure which one is the most probable?

Check out the iterator and confidence examples here: https://tesseract-ocr.github.io/tessdoc/APIExample.html
 
Tom
Reply all
Reply to author
Forward
0 new messages