Hi,
On 12/02/2022 22:13, Alberto Simoes wrote:
> Hi
>
> I am OCRing a lot of documents. I have a document with very poor
> quality, and surely nothing will be recognized. But I need a stable
> pipeline, and while I was expecting tesseract just to return an empty
> document, I am getting this error:
>
> Detected 958 diacritics
> Error during processing.
>
> Is there anything I can do to use tesseract more reliably, without the
> chance of getting it to just die?
You can try using a different binarisation method, or cleaning up the
images before doing OCR. Do you have an example you can share?
Tesseract 5.0.0 should support -c thresholding_method=2 and additionally
you can pass the --dpi 300 (or whatever value it is) for your image.
That might make it more robust even without pre-processing your images.
> By the way, I am using it through pytesseract, but I do not think that
> is the problem.
I don't know if pytesseract supports these extra options, so you might
have to fiddle with that.
Regards,
Merlijn