How to get started / lack of resources

145 views
Skip to first unread message

Stefan Heise

unread,
Jun 5, 2012, 2:36:10 PM6/5/12
to ocr...@googlegroups.com
I'm trying to get started using OCRopus and find it very cumbersome. Who of you is actually productively using OCRopus and how did you learn it?

This is where I'm currently:
After a bit of research and a bugfix I managed to install OCropus 0.5 and actually do a test run that didn't return a fatal error. The result is unusable, though - so now I need to get into the details of training and configuration. The first point I want to improve is the binarization, which returned unusable results - way too light, there was basically nothing more to recognize in the binarized picture. "ocropus-preproc -h" tells me some parameters to tweak: Ground truth extension, zoom, character component size, halftone removal, deskewing, sigma and k value. The issue is: I don't really know what any of these parameters mean exactly, or how to sensibly use them. Sure, there is Google and Wikipedia, and I have actually watched all the YouTube videos available, but at the end of the day I was not able to find out concrete measures how to improve my binarization results. I tried using some estimated numbers for sigma and k, but that apparently had no effect whatsoever. What I - and apparently other newbie users around here - really need is a manual-like introduction to the whole system, like: "A ground truth is defined as abc, while a ground truth extension is xyz. ... Parameter x needs to be a value between y and z, lower x means ... higher x means..."

I feel like there must be an OCRopus bootcamp somewhere, maybe a lecture or a manual that I just completely missed in my search and that enabled all the other users to actually make productive use of OCRopus. I'm a computer scientist and somewhat experienced software developer, so I can take technical language and am a quick learner. I'd even be willing to pay someone to teach me (within reasonable boundaries) or would be willing to write such a manual in return. Can anyone help me by pointing me to the right resources, or is personal training for OCRopus usage (maybe remote) available?

Tom

unread,
Jun 13, 2012, 7:47:25 PM6/13/12
to ocr...@googlegroups.com

The first point I want to improve is the binarization, which returned unusable results - way too light, there was basically nothing more to recognize in the binarized picture.

We already have much better versions of preprocessing and layout analysis, and they are going to be released soon.  They work robustly on a much wider range of images than the current preprocessing.  The 0.6 release is going to include those, plus a new line segmenter.

I feel like there must be an OCRopus bootcamp somewhere, maybe a lecture or a manual that I just completely missed in my search and that enabled all the other users to actually make productive use of OCRopus.

Not yet.  There was for the old C++ version, but that's obsolete now.  Documentation is something we'll tackle after the next release.

Tom
Reply all
Reply to author
Forward
0 new messages