Ideal config settings for finetuned monospace text?

60 views

Skip to first unread message

Dustin Spicuzza

unread,

Sep 13, 2019, 2:00:51 AM9/13/19

to tesseract-ocr

Hey,

Using @shreeshrii's excellent examples at https://github.com/Shreeshrii/tessdata_shreetest, I've fine tuned on a single monospace font with a giant pile of representative data. With very little effort the recognition results have been significantly better than using the stock english data -- just a few errors per page. Thanks so much!

However, I'd like to get even closer to zero errors. I've been trying to constrain my problem in an effort to get better results:

Known monospaced font, font size, page size
Known character set (ASCII)
Data layout is fairly consistent

Are there configuration settings that I can use to provide hints to tesseract about the nature of the data? I don't really want it to do layout or blocks or anything particularly fancy, I just want it to recognize all the text and give it to me. I've been using page segment mode 6 (Assume a single uniform block of text). I've been going through the wiki but I haven't been able to make much more progress there.

Thanks for any tips!

Dustin

Timothy Snyder

unread,

Sep 13, 2019, 9:15:49 AM9/13/19

to tesser...@googlegroups.com

Have you tried using PSM 13? I get a few % more than 6 on my dataset. Also, what kind of image preprocessing are you doing? I've reclaimed a ton of accuracy finely tuning my preprocessing. Mind posting some pictures of what you're recognizing?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4bfaf2ed-a8a0-429b-8b8f-cc9db11ba5a8%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages