Too many diacritics can make process die?

577 views

Skip to first unread message

Alberto Simoes

unread,

Feb 13, 2022, 2:08:38 AM2/13/22

to tesseract-ocr

I am OCRing a lot of documents. I have a document with very poor quality, and surely nothing will be recognized. But I need a stable pipeline, and while I was expecting tesseract just to return an empty document, I am getting this error:

Detected 958 diacritics

Error during processing.

Is there anything I can do to use tesseract more reliably, without the chance of getting it to just die?

By the way, I am using it through pytesseract, but I do not think that is the problem.

Thank you

Merlijn B.W. Wajer

unread,

Feb 13, 2022, 3:15:01 AM2/13/22

to tesser...@googlegroups.com

Hi,

On 12/02/2022 22:13, Alberto Simoes wrote:
> Hi
>
> I am OCRing a lot of documents. I have a document with very poor
> quality, and surely nothing will be recognized. But I need a stable
> pipeline, and while I was expecting tesseract just to return an empty
> document, I am getting this error:
>
> Detected 958 diacritics
> Error during processing.
>
> Is there anything I can do to use tesseract more reliably, without the
> chance of getting it to just die?

You can try using a different binarisation method, or cleaning up the
images before doing OCR. Do you have an example you can share?

Tesseract 5.0.0 should support -c thresholding_method=2 and additionally
you can pass the --dpi 300 (or whatever value it is) for your image.
That might make it more robust even without pre-processing your images.

> By the way, I am using it through pytesseract, but I do not think that
> is the problem.

I don't know if pytesseract supports these extra options, so you might
have to fiddle with that.

Regards,
Merlijn

Reply all

Reply to author

Forward

0 new messages