Hi all,
I have several scanned documents that have a lot of noise between lines and between words. Tesseract fails to ignore them and it either includes them in the next character or makes them a separate character, often a dot or comma. I attached an image that shows some of that noise.
I am using the latest SVN version of Tesseract 3.03. Tesseract 3.02 does slightly better at ignoring the noise.
Now my questions are:
- What are the configuration parameters (maybe also hard coded constants) inside Tesseract that affect the noise vs. good blob classification?
- Is there a way to define a minimum number of pixels or dimensions for a connected component?
- Is there a way to limit the scaling of a blob, so that it won't get matched to a character prototype?
I already found the configuration parameters heavy_noise_reduction and textord_noise_hfract, but heavy_noise_reduction gives me bad results and by teaking textord_noise_reduction I can get better results, but they still aren't satisfying. Maybe there's a better alternative in the code that I can't find.
Regards,
Paul