How do I only detect text of one size?

Radu Stoicescu

unread,

Sep 25, 2020, 1:57:14 AM9/25/20

to tesseract-ocr

I have some scanned, machine typed, that have a lot of noise. I can reduce the noise, and I have done so. But there is some noise that is statistically indistinguishable from letters: as dark as the letters and as big as the letters, therefore I cannot just take it out.

I have tried to only train Tesseract on Courier New, and although the accuracy went down, which was expected because I did not use enough data, there were still letters detected in the noisy areas.

How can I keep Tesseract from detecting letters in noise? One simple rule would be to only detect characters of one size, since this is machine typed text.

Zdenko Podobny

unread,

Sep 25, 2020, 2:05:50 AM9/25/20

to tesser...@googlegroups.com

Maybe it would be good to provide some examples of input.

Zdenko

pi 25. 9. 2020 o 7:57 Radu Stoicescu <radust...@gmail.com> napísal(a):

I have some scanned, machine typed, that have a lot of noise. I can reduce the noise, and I have done so. But there is some noise that is statistically indistinguishable from letters: as dark as the letters and as big as the letters, therefore I cannot just take it out.

I have tried to only train Tesseract on Courier New, and although the accuracy went down, which was expected because I did not use enough data, there were still letters detected in the noisy areas.

How can I keep Tesseract from detecting letters in noise? One simple rule would be to only detect characters of one size, since this is machine typed text.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/24c0b6ae-e07b-443b-ba60-38470b852275n%40googlegroups.com.

Radu Stoicescu

unread,

Sep 25, 2020, 2:34:15 AM9/25/20

to tesseract-ocr

The first image is OCR before any pre-processing, the second image is after pre-processing. As you can see there are a few problematic areas. I can understand that there is very little to be done where the line-like noise is confused with an underscore but the 2 areas where the 2 "e" and the 1 "e" are detected something could be done.

As I said, I tried to retrain the top using only "Courier new" but the noise was still detected as letters, that was surprising to me. I thought that the false positives are because of the large amount of different and strange characters Tesseract is trained on.

Reply all

Reply to author

Forward