Is it possible to pass a numpy array to Tesseract, instead of saving it to the disk.

1,425 views
Skip to first unread message

Ayush Pandey

unread,
Oct 30, 2019, 1:28:57 AM10/30/19
to tesseract-ocr
Hi,

I want to run the trained tesseract model through a python script ( using PyTesseract for this purpose right now ). Is there a way by which I can pass a numpy array to Tesseract without saving it to the disk ( writing to disk is quite slow and time consuming ).

Thanks and Regards,
Ayush Pandey.

Zdenko Podobny

unread,
Oct 30, 2019, 3:06:16 AM10/30/19
to tesser...@googlegroups.com
It is not possible with  PyTesseract  as it use tesseract executable with input from disk.
If you would use tesseract API directly you need to convert numpy to PIX (leptonica) structure [1] . 


Zdenko


st 30. 10. 2019 o 6:29 Ayush Pandey <xapia...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ef745c36-3e31-42bc-b56e-c81bc5273f2e%40googlegroups.com.

Juanjo Serrano Lloria

unread,
Oct 30, 2019, 3:25:56 AM10/30/19
to tesseract-ocr
Hi,

Perhaps a solution is to create a memory filesystem.

https://docs.pyfilesystem.org/en/latest/reference/memoryfs.html

Zdenko Podobny

unread,
Oct 30, 2019, 3:41:20 AM10/30/19
to tesser...@googlegroups.com
Tesseract executable can read image data (not numpy!) from stdin and past them to stdout so at least IO operation can be avoided. Not sure if first part (reading from stdin) can be implemented in pytesseract, but for second part should be no problem.
 
If somebody is looking for performance seriously, using tesseract executable is not good approach: each time you start tesseract it needs to initialize language model e.g. to read  several Mb from disk which is especially with small image pure waste of resources. 

Tesseract >=4.1 support also compressed language model so this is can help too if disk IO operations are problem.

Zdenko


st 30. 10. 2019 o 8:25 Juanjo Serrano Lloria <juanjo....@letsrebold.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Ayush Pandey

unread,
Oct 30, 2019, 4:19:24 AM10/30/19
to tesseract-ocr
Hi,
Thanks a lot for your response Zdenko and Juanjo
After reading your advice, Zdenko I am now using tesserocr for inference:

from tesserocr import PyTessBaseAPI, PSM, OEM

from PIL import Image


with PyTessBaseAPI(psm=PSM.RAW_LINE, oem=OEM.LSTM_ONLY) as api:

    image
= Image.open("test.jpg")
    api
.SetImage(image)
    text
= api.GetUTF8Text()  


I tested it with pytesseract.image_to_string and got a gain of over 3X. Thanks a lot for your response Zdenko. If you have any further suggestions then do let me know



On Wednesday, October 30, 2019 at 1:11:20 PM UTC+5:30, zdenop wrote:
Tesseract executable can read image data (not numpy!) from stdin and past them to stdout so at least IO operation can be avoided. Not sure if first part (reading from stdin) can be implemented in pytesseract, but for second part should be no problem.
 
If somebody is looking for performance seriously, using tesseract executable is not good approach: each time you start tesseract it needs to initialize language model e.g. to read  several Mb from disk which is especially with small image pure waste of resources. 

Tesseract >=4.1 support also compressed language model so this is can help too if disk IO operations are problem.

Zdenko


st 30. 10. 2019 o 8:25 Juanjo Serrano Lloria <juanjo...@letsrebold.com> napísal(a):
Hi,

Perhaps a solution is to create a memory filesystem.

https://docs.pyfilesystem.org/en/latest/reference/memoryfs.html


El miércoles, 30 de octubre de 2019, 6:28:57 (UTC+1), Ayush Pandey escribió:
Hi,

I want to run the trained tesseract model through a python script ( using PyTesseract for this purpose right now ). Is there a way by which I can pass a numpy array to Tesseract without saving it to the disk ( writing to disk is quite slow and time consuming ).

Thanks and Regards,
Ayush Pandey.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Lorenzo Bolzani

unread,
Oct 30, 2019, 4:32:41 AM10/30/19
to tesser...@googlegroups.com
Hi,
using the API through tesserocr I use

api.SetImageBytes(raw_img.tobytes(), raw_img.shape[1], raw_img.shape[0], 1, raw_img.shape[1])

I recommend using this over pytesseract even if the installation sometimes may be a little more complex.

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ef745c36-3e31-42bc-b56e-c81bc5273f2e%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages