Handling noise

79 views
Skip to first unread message

Paul

unread,
Aug 2, 2014, 4:14:07 PM8/2/14
to tesser...@googlegroups.com
Hi all,

I have several scanned documents that have a lot of noise between lines and between words. Tesseract fails to ignore them and it either includes them in the next character or makes them a separate character, often a dot or comma. I attached an image that shows some of that noise.

I am using the latest SVN version of Tesseract 3.03. Tesseract 3.02 does slightly better at ignoring the noise.

Now my questions are:
  1. What are the configuration parameters (maybe also hard coded constants) inside Tesseract that affect the noise vs. good blob classification?
  2. Is there a way to define a minimum number of pixels or dimensions for a connected component?
  3. Is there a way to limit the scaling of a blob, so that it won't get matched to a character prototype?
I already found the configuration parameters heavy_noise_reduction and textord_noise_hfract, but heavy_noise_reduction gives me bad results and by teaking textord_noise_reduction I can get better results, but they still aren't satisfying. Maybe there's a better alternative in the code that I can't find.

Regards,
Paul
noise.png
Reply all
Reply to author
Forward
0 new messages