Tesseract security considerations

586 views

Skip to first unread message

José Luis Mendoza Azanza

unread,

Dec 9, 2016, 12:39:20 AM12/9/16

to tesseract-ocr

I am integrating Tesseract into an application, but I have some questions before keep going with the process.

I think every application should have security filters and considerations in order to avoid malicious and bad input data, so my questions are:

Does Tesseract have special code to handle bad or malicious input data?
Or just have a few validations to tell the user the correct input data?
Releases are performed after doing some security reviews and testing?
Or just functional testing?

I will appreciate your answers.

Thanks a lot!

James R Barlow

unread,

Jan 13, 2017, 12:02:25 PM1/13/17

to tesseract-ocr

On Thursday, December 8, 2016 at 9:39:20 PM UTC-8, José Luis Mendoza Azanza wrote:

I am integrating Tesseract into an application, but I have some questions before keep going with the process.

I think every application should have security filters and considerations in order to avoid malicious and bad input data, so my questions are:
Does Tesseract have special code to handle bad or malicious input data?

Bad data for tesseract means an invalid image of some kind. It uses the leptonica library which does a number of sanity checks on images. It does not do anything special.

In its current form I would not consider it safe to allow a potential attacker to submit a chosen image to tesseract. I would assume that remote code execution vulnerabilities exist. Using ImageMagick or Pillow to sanitize the image before tesseract gets to see it.

Or just have a few validations to tell the user the correct input data?

What's there is pretty basic for the command line input, and the API has even less.

Releases are performed after doing some security reviews and testing?

To my knowledge, no, there's never been a formal security review. There's a lot of ugly legacy C++ and C and questionable practices in the code, honestly.

Or just functional testing?

The CI scripts only checks that Tesseract compiles on a some supported platforms. There's a test suite that checks OCR quality in a statistical sense, but not correctness or valid output per se.

Reply all

Reply to author

Forward

0 new messages