Improve text extraction when some text is inverted

Chris

unread,

Jul 2, 2021, 2:12:00 AM7/2/21

to tesseract-ocr

I am experimenting with Tesseract 4.1.1 using C# to extract text from black and white or greyscale TIF images of semi structured forms that are 300 dpi.

The results are really promising except when some of the text is inverted (ie white on black). In these cases the results are poor. Can anyone suggest ways tackle this? All the discussions I have seen are for when the whole image is inverted, but here it is only some of the text?

Regards,

Chris

Merlijn B.W. Wajer

unread,

Jul 2, 2021, 4:36:52 AM7/2/21

to tesser...@googlegroups.com

Hi,

Maybe give the latest 5.0.0 alpha a try? I believe it contains various
changes to inverted text handling, at least this:
https://github.com/tesseract-ocr/tesseract/pull/3141

Regards,
Merlijn

Zdenko Podobny

unread,

Jul 2, 2021, 6:56:26 AM7/2/21

to tesser...@googlegroups.com

You provided no example, so just hint: have a look at the leptonica function pixAutoPhotoinvert[1], that should help in such cases. Function is available IMO from version 1.79.0

[1] https://github.com/DanBloomberg/leptonica/blob/5aaf1c187deeef7f47288c6b0833a07021940da7/src/pageseg.c#L2370-L2391

Zdenko

pi 2. 7. 2021 o 8:11 'Chris' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9681ae4f-f443-4a92-b1f4-e2a8919981a9n%40googlegroups.com.

Chris

unread,

Jul 2, 2021, 1:49:10 PM7/2/21

to tesseract-ocr

Thanks to both of you for replying. I'm using Charles Weld's NuGet package (https://github.com/charlesw/tesseract/) so at the moment I think I am stuck on version 4.1.1. I have to admit Tesseract is a bit of a black box to me, and short of setting a few variables I am not I am at a bit of a loss in its use.

I'm not sure if I have access to calling Leptonica, and am unsure if my questions are better directed here or to Charles Weld.

Having looked at the pixAutoPhotoinvert code I could try and implement something similar in C# prior to handing the image to Tesseract. Thanks for that. Worst case I cause get Tesseract to look at the original image and an inverted image and then combine the results. Whilst simpler, that would double the time taken.

If it helps I could provide a sample C# project next week.

Chris

Reply all

Reply to author

Forward