Handling noise

79 views

Skip to first unread message

Paul

unread,

Aug 2, 2014, 4:14:07 PM8/2/14

to tesser...@googlegroups.com

Hi all,

I have several scanned documents that have a lot of noise between lines and between words. Tesseract fails to ignore them and it either includes them in the next character or makes them a separate character, often a dot or comma. I attached an image that shows some of that noise.

I am using the latest SVN version of Tesseract 3.03. Tesseract 3.02 does slightly better at ignoring the noise.

Now my questions are:

What are the configuration parameters (maybe also hard coded constants) inside Tesseract that affect the noise vs. good blob classification?
Is there a way to define a minimum number of pixels or dimensions for a connected component?
Is there a way to limit the scaling of a blob, so that it won't get matched to a character prototype?

I already found the configuration parameters heavy_noise_reduction and textord_noise_hfract, but heavy_noise_reduction gives me bad results and by teaking textord_noise_reduction I can get better results, but they still aren't satisfying. Maybe there's a better alternative in the code that I can't find.

Regards,

Paul

noise.png

Reply all

Reply to author

Forward

0 new messages