Python \ OpenCv \ Tesseract stdin

736 views
Skip to first unread message

Aaron A

unread,
Mar 9, 2017, 12:25:18 AM3/9/17
to tesseract-ocr
I have been unable to find an example where through Python an OpenCv image could be passed to Tesseract via stdin (as opposed to writing the image to a file and then passing tesseract the file path).

Here is the code I have so far but it throws an error.

import cv2
import numpy
import subprocess


frame = cv2.imread("image.jpg", 1)


command = ["tesseract",
'stdin',
'stdout']

tesseract_process = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE)


result = tesseract_process.communicate(input=frame.tostring())[0]
result = tesseract_process.stdin.write(frame.tostring())
print(result.decode())

and the error I get is:

Error in fopenReadStream: file not found
$" #! !! && ** ** ** ** ++ *- '* (+ (+ *- #& "$ "$ !# ! " "$ !# ! " " ! ! ! ! ! " " ! " " ! " # " ! !
$" #! !! && ** ** ** ** ++ *- '* (+ (+ *- #& "$ "$ !# ! " "$ !# ! " " ! ! ! ! ! " " ! " " ! " # " ! ! cannot be read!
Error during processing.
ObjectCache(5A2E0A88)::~ObjectCache(): WARNING! LEAK! object 02770A70 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatalstm-punc-dawg)
ObjectCache(5A2E0A88)::~ObjectCache(): WARNING! LEAK! object 02752128 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatalstm-word-dawg)
ObjectCache(5A2E0A88)::~ObjectCache(): WARNING! LEAK! object 027521D8 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatalstm-number-dawg)
ObjectCache(5A2E0A88)::~ObjectCache(): WARNING! LEAK! object 0270BF30 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatapunc-dawg)
ObjectCache(5A2E0A88)::~ObjectCache(): WARNING! LEAK! object 035D29A8 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddataword-dawg)
ObjectCache(5A2E0A88)::~ObjectCache(): WARNING! LEAK! object 02F63088 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatanumber-dawg)
ObjectCache(5A2E0A88)::~ObjectCache(): WARNING! LEAK! object 02F67A88 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatabigram-dawg)
ObjectCache(5A2E0A88)::~ObjectCache(): WARNING! LEAK! object 02F67B30 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatafreq-dawg)


Any help or guidance would be really appreciated. 

I do know that there is a python library https://pypi.python.org/pypi/piltesseract which passes images to tesseract via stdin.
However that library uses PIL and not OpenCv.

I've also tried 

tesseract_process.stdin.write(frame.tostring())
output = tesseract_process.stdout.read()
print output

but this simply hangs and never prints.

Thank you













John Slade

unread,
Mar 9, 2017, 4:58:39 AM3/9/17
to tesseract-ocr

Tesseract stdin doesn't accept raw numpy frames, it needs to be encoded in an image format (like png/bmp etc).


This is exactly what the Piltesseract library does using the PIL library:

https://github.com/Digirolamo/PILtesseract/blob/master/piltesseract/tesseractwrapper.py#L131:L143


In opencv you can do the equivalent using cv2.imencode

http://docs.opencv.org/3.0-beta/modules/imgcodecs/doc/reading_and_writing_images.html#imencode


The following code works for me:



import cv2
import subprocess

frame = cv2.imread("image.png", 1)

tesseract_process = subprocess.Popen(
    ["tesseract", 'stdin', 'stdout'],
    stdin=subprocess.PIPE, stdout=subprocess.PIPE)

ret, img = cv2.imencode(".bmp", frame)

result = tesseract_process.communicate(input=img.tostring())[0]
print(result.decode())


John


From: tesser...@googlegroups.com <tesser...@googlegroups.com> on behalf of Aaron A <aaron...@gmail.com>
Sent: 09 March 2017 00:54:28
To: tesseract-ocr
Subject: [tesseract-ocr] Python \ OpenCv \ Tesseract stdin
 
[External email]

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c8a72a6b-28e4-4b92-87e0-a1f1921a8cbf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[External email. Treat hyperlinks and attachments with caution]

This transmission contains information that may be confidential and contain personal views which are not necessarily those of YouView TV Ltd. YouView TV Ltd (Co No:7308805) is a limited liability company registered in England and Wales with its registered address at YouView TV Ltd, 3rd Floor, 10 Lower Thames Street, London, EC3R 6YT. For details see our web site at http://www.youview.com

Aaron A

unread,
Mar 11, 2017, 1:52:17 PM3/11/17
to tesseract-ocr
thanks so much for the working sample code! perfecto.
Reply all
Reply to author
Forward
0 new messages