Assertion failure in octopus-align aborts processing of remainder of files

MattJ

unread,

Nov 11, 2012, 7:35:57 PM11/11/12

to ocr...@googlegroups.com

I'm currently training Ocropus character models, and I'm following the example of fraktur-boxes and uw3-500. In the ocropus-align step, I've noticed that some lines will fail the e.seg[1]==0 assertion on line 234. Once this happens, processing stops for the remainder of the files. I've patched this to abandon the current line and continue the for loop, but I'm reluctant to submit a patch as I don't really follow what exactly is happening here.

I can provide you with any additional help in nailing this problem down. I anticipate using Ocropus for a research project for the next year and a half, so I hope I'll be able to contribute patches for anything I'm able to make better. Thanks for this invaluable set of tools.

Tom

unread,

Nov 20, 2012, 1:20:17 AM11/20/12

to ocr...@googlegroups.com

Hi,

On Monday, November 12, 2012 1:35:58 AM UTC+1, MattJ wrote:

I'm currently training Ocropus character models, and I'm following the example of fraktur-boxes and uw3-500. In the ocropus-align step, I've noticed that some lines will fail the e.seg[1]==0 assertion on line 234. Once this happens, processing stops for the remainder of the files. I've patched this to abandon the current line and continue the for loop, but I'm reluctant to submit a patch as I don't really follow what exactly is happening here.

The line basically says that if there is a space in the transcription, there shouldn't be a corresponding set of pixels in the segmentation. I'm not sure why this is happening, but if it happens rarely, it's probably safe to skip such lines. All you care about with alignment is to get a large amount of training data.

ocropus-align implements Viterbi alignment. In the long term, we'll probably move to forward-backward training, which tends to be better behaved.

OCRopus 0.7 will contain a new recognizer based on recurrent neural networks; training that is much simpler and may be a better match to your needs.

Tom

MattJ

unread,

Nov 21, 2012, 9:33:47 AM11/21/12

to ocr...@googlegroups.com

Thanks Tom, training my character model has been fairly successful (~0.1 edit distance/character in a diverse and degraded document set). Only very few lines had this problem.

Tom

unread,

Nov 21, 2012, 7:33:40 PM11/21/12

to ocr...@googlegroups.com

OK. Incidentally, the same training data should be usable with the new 0.7 recognizer. I hope we'll be releasing that before the end of the year. I put up some sample results on ocropus.org

Tom

Sriranga(78yrsold)

unread,

Nov 22, 2012, 2:09:15 AM11/22/12

to ocr...@googlegroups.com

Tom,

Instead of waiting for releasing new 0.7 recognizer till the end of the year, it would be nice to have new 0.7(pre) recognizer for beta testing and feed back to you.

For this purpose, it would be better to furnish step by step (with example of command lines to be used by the user).
Kindly ensure that all relevant python programs have supported for utf-8.

For this purpose attached very sample kannada files(txt,tif,box - generated and used in tesseract-ocr) for testing and research. I am ready to perform beta-testing and feedback to you at any time

With warmest regards,
-sriranga(79yrs)

--
You received this message because you are subscribed to the Google Groups "ocropus" group.
To post to this group, send email to ocr...@googlegroups.com.
To unsubscribe from this group, send email to ocropus+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/3VZi7C2A8W4J.

For more options, visit https://groups.google.com/groups/opt_out.

k27u.txt

k27u.tif

k27u.box

Reply all

Reply to author

Forward