sending image data directly to Tesseract

397 views
Skip to first unread message

Flávio.

unread,
Feb 13, 2023, 3:23:13 PM2/13/23
to tesseract-ocr
Hi, I'm a beginner, I wonder if there is a way to send image byte data directly to Tesseract from memory without having to write a file to disk . I'm building a program that uses Dart, but I can also write Python, Java or JS.

Merlijn B.W. Wajer

unread,
Feb 13, 2023, 4:11:52 PM2/13/23
to tesser...@googlegroups.com

Hi,
If you write/echo an image to the stdin of Tesseract, and tell it to
read from "-", this ought to work, like so:

> $ cat foo.png | tesseract - output

Of course you can do this (write to stdin) in most programming languages.

Cheers,
Merlijn

Flávio.

unread,
Feb 14, 2023, 1:11:35 PM2/14/23
to tesseract-ocr
Sorry, how can I do that?  I'm trying to send image binary data, not a path. The goal is to not write a file to disk and use only memory. Could you please write a code that sends the data (binary) to the stdin of tesseract? it can be in Python, Dart or Java :(  I've tried ChatGPT but it is wrong and gets lost

Merlijn B.W. Wajer

unread,
Feb 14, 2023, 2:11:13 PM2/14/23
to tesser...@googlegroups.com

Hi,

On 14/02/2023 19:10, Flávio. wrote:
> Sorry, how can I do that?  I'm trying to send image binary data, not a
> path. The goal is to not write a file to disk and use only memory. Could
> you please write a code that sends the data (binary) to the stdin of
> tesseract? it can be in Python, Dart or Java :(  I've tried ChatGPT but
> it is wrong and gets lost

Normally I'd say 'left as an exercises to the reader' but I so happen to
have a snippet around that ought to give you a general idea.

This uses io.BytesIO in Python 3 to save the image (stream) to, it
contains an uncompressed PNG (compression will just slow things down).
It assumes that the variable "pil_image" contains a PIL.Image object.

The code to use just one core in Tesseract is of course entirely
optional. I didn't *test* this to work (I modified it a bit - it works
in another setting), but it should work in theory:

> with io.BytesIO() as output:
> pil_image.save(output, format='PNG', compress=0, compress_level=0)
> output.seek(0)
>
> # Let's just use one core in tesseract
> env = os.environ.copy()
> env['OMP_THREAD_LIMIT'] = '1'
>
> p = subprocess.Popen(['tesseract', '-', '-'],
> stdin=subprocess.PIPE,
> stdout=subprocess.PIPE,
> stderr=subprocess.PIPE,
> env=env)
> output, stderr = p.communicate(output.read())
> stderr = stderr.decode('utf-8')
>
> if stderr:
> logger.warning('tesseract_baselines stderr: %s', stderr)


Regards,
Merlijn

Flávio.

unread,
Feb 14, 2023, 3:16:43 PM2/14/23
to tesseract-ocr
Thanks, i'm still trying to figure it out. It seems when the PIL image is saved, unfortunately it saves a temp file to disk. My goal is to not write to disk, because this application will read a lot of files and I want to spare my SSD. My code receives byte data from a Dart program (I checked it is correct).   So far the py file looks like this but i'm not getting anything in return.

def main():
    base64_image = sys.stdin.read()
    image_bytes = base64.b64decode(base64_image)
    with io.BytesIO(image_bytes) as input:
        pil_image = Image.open(input)
        with io.BytesIO() as output:
            pil_image.save(output, format='PNG', compress=0, compress_level=0)  # using disk!
            output.seek(0)

            env = os.environ.copy()
            env['OMP_THREAD_LIMIT'] = '1'

            p = subprocess.Popen([tesseractPath, '-', '-','-l','por'],
                                 stdin=subprocess.PIPE,
                                 stdout=subprocess.PIPE,
                                 stderr=subprocess.PIPE,
                                 env=env)
            output, stderr = p.communicate(output.read())
            stderr = stderr.decode('utf-8')

            if stderr:
                logger.warning('tesseract_baselines stderr: %s', stderr)
            else:
                sys.stdout(output.encode('utf-8').strip())


if __name__ == '__main__':
     main()


Merlijn B.W. Wajer

unread,
Feb 14, 2023, 3:49:45 PM2/14/23
to tesser...@googlegroups.com
Hi,

On 14/02/2023 21:16, Flávio. wrote:
> Thanks, i'm still trying to figure it out. It seems when the PIL image
> is saved, unfortunately it saves a temp file to disk. My goal is to not
> write to disk, because this application will read a lot of files and I
> want to spare my SSD. My code receives byte data from a Dart program (I
> checked it is correct).   So far the py file looks like this but i'm not
> getting anything in return.

At this point I maybe ought to reply off list, but you could also save
the images to a "tmpfs" on Linux if you don't want to deal with stdin
and have the images never hit the disk/ssd:
https://www.kernel.org/doc/html/v5.7/filesystems/tmpfs.html

In any case - when you write that PIL 'saves a temp file to disk', what
lead you to that conclusion? The code below really shouldn't do that.

Regards,
Merlijn
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com
> <mailto:tesseract-oc...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ebd9d42d-244c-4a6b-8ab8-c1efd87db501n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ebd9d42d-244c-4a6b-8ab8-c1efd87db501n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Flávio.

unread,
Feb 14, 2023, 3:59:11 PM2/14/23
to tesseract-ocr
I'll look into that Linux option :)  as for the save method, I used the show method on the object and it had a path in the temp directory. So I asked ChatGPT how the file could be on disk and it told me that the save method created it. If you're right then it's just another hallucination by the model. I use it to teach me, as I'm learning alone. Thanks for the replies 🙂

Merlijn B.W. Wajer

unread,
Feb 14, 2023, 4:04:36 PM2/14/23
to tesser...@googlegroups.com
Hi,

On 14/02/2023 21:59, Flávio. wrote:
> I'll look into that Linux option :)  as for the save method, I used the
> show method on the object and it had a path in the temp directory. So I
> asked ChatGPT how the file could be on disk and it told me that the save
> method created it. If you're right then it's just another hallucination
> by the model. I use it to teach me, as I'm learning alone. Thanks for
> the replies 🙂

The show() method very likely saves it to a temporary path just for the
purpose of showing you the image. I'm pretty certain that the code you
mailed (modified from mine) doesn't save the file to disk.

And yes, tmpfs is another option.

Let's take it off list if you have any further questions not related
specifically to Tesseract. :-)

Regards,
Merlijn
> <https://groups.google.com/d/msgid/tesseract-ocr/ebd9d42d-244c-4a6b-8ab8-c1efd87db501n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/tesseract-ocr/ebd9d42d-244c-4a6b-8ab8-c1efd87db501n%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com
> <mailto:tesseract-oc...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/14c7e227-7f84-4e0d-91c6-f5f257a07cd6n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/14c7e227-7f84-4e0d-91c6-f5f257a07cd6n%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages