Ideal config settings for finetuned monospace text?

60 views
Skip to first unread message

Dustin Spicuzza

unread,
Sep 13, 2019, 2:00:51 AM9/13/19
to tesseract-ocr
Hey,

Using @shreeshrii's excellent examples at https://github.com/Shreeshrii/tessdata_shreetest, I've fine tuned on a single monospace font with a giant pile of representative data. With very little effort the recognition results have been significantly better than using the stock english data -- just a few errors per page. Thanks so much!

However, I'd like to get even closer to zero errors. I've been trying to constrain my problem in an effort to get better results:
  • Known monospaced font, font size, page size
  • Known character set (ASCII)
  • Data layout is fairly consistent
Are there configuration settings that I can use to provide hints to tesseract about the nature of the data? I don't really want it to do layout or blocks or anything particularly fancy, I just want it to recognize all the text and give it to me. I've been using page segment mode 6 (Assume a single uniform block of text). I've been going through the wiki but I haven't been able to make much more progress there.

Thanks for any tips!

Dustin

Timothy Snyder

unread,
Sep 13, 2019, 9:15:49 AM9/13/19
to tesser...@googlegroups.com
Have you tried using PSM 13? I get a few % more than 6 on my dataset. Also, what kind of image preprocessing are you doing? I've reclaimed a ton of accuracy finely tuning my preprocessing. Mind posting some pictures of what you're recognizing?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4bfaf2ed-a8a0-429b-8b8f-cc9db11ba5a8%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages