I have some scanned, machine typed, that have a lot of noise. I can reduce the noise, and I have done so. But there is some noise that is statistically indistinguishable from letters: as dark as the letters and as big as the letters, therefore I cannot just take it out.
I have tried to only train Tesseract on Courier New, and although the accuracy went down, which was expected because I did not use enough data, there were still letters detected in the noisy areas.
How can I keep Tesseract from detecting letters in noise? One simple rule would be to only detect characters of one size, since this is machine typed text.