Upgrading Ocropus versions

Jonathan Hung

unread,

Jan 16, 2012, 12:51:11 PM1/16/12

to Thomas Breuel, dec...@googlegroups.com

Hi Thomas,

Current we are using Decapod 0.4.4 in our development, and we were wondering if it is worthwhile to upgrade to a later version of Ocropus? What are your thoughts on this?

The reason to upgrade would be to:

- Quality improvements to segmentation

- Bug fixes

If possible, we would like to do an ocropus upgrade in this release if it's feasible.

Thanks,

- Jon.

Jonathan Hung

unread,

Feb 7, 2012, 12:29:18 PM2/7/12

to Thomas Breuel, dec...@googlegroups.com

Hi Thomas,

I'm wondering if we can revisit this discussion about upgrading Ocropus. If an upgrade is merited we would like to schedule this into the roadmap and make the necessary upgrades to dependent code.

Thoughts?

- Jonathan.

Jonathan Hung

unread,

Feb 10, 2012, 2:39:38 PM2/10/12

to Thomas Breuel, dec...@googlegroups.com

To be more specific, we have a few known issues we are wondering if an upgrade to Ocropus may help?

1. DECA-238: Type 2 PDFs have poor OCR results with reasonably captured documents

2. DECA-58: Export to PDF skips over pages that do not have detected characters
3. DECA-211: Certain PNG/JPG files create colour inverted PDF

Thanks.

- Jonathan.

Thomas Breuel

unread,

Feb 13, 2012, 11:43:08 PM2/13/12

to Jonathan Hung, dec...@googlegroups.com

Hi,

sorry for not responding earlier. I'm still training the new line recognizer. The error rate is cut in half relative to the old recognizer when measured on UW3 using preliminary models, and it may get a bit better yet. The final recognizer will take a few more weeks of training and testing (it's largely an automatic process). To get to this point, there I wrote an entirely new classifier, plus a new testing infrastructure, and a lot of data wrangling. The display, editing, and data interchange are now based on HDF5 (much faster than sqlite). Pluse there has been a lot of refactoring, bug fixing, etc. The language modeling and alignment code has also been rewritten.

I'm still not sure exactly what form I'm going to push it out in; right now, it's separate from the ocropy package, and I may leave it that way, or I may integrate it. There has also been a lot of refactoring in other parts of OCRopus that affect installation. However, the command lines have generally remained the same.

Here is how that may (or may not) affect these bugs:

1. DECA-238: Type 2 PDFs have poor OCR results with reasonably captured documents -- This is probably due to resolution issues. It may or may not be fixed by the new recognizer (the new recognizer is more robust to scale changes than the old one).

2. DECA-58: Export to PDF skips over pages that do not have detected characters -- This is part of the page segmentation and would need to be addressed in ocropus-binarize. The binarizer is now a standalone command line Python program that should be easier to modify.

3. DECA-211: Certain PNG/JPG files create colour inverted PDF -- There was some automatic logic in the old binarizer for detecting inverted pages. It should probably just be disabled.

Tom

Jonathan Hung

unread,

Feb 23, 2012, 10:27:31 AM2/23/12

to Thomas Breuel, dec...@googlegroups.com

Hi Tom,

Thanks for responding. I'm glad to hear about the significant advancements in the upcoming release. Any idea when a release can be expected?

We're excited to get our hands on it to test and integrated into Decapod.

- Jonathan.

--
You received this message because you are subscribed to the Google Groups "Decapod" group.
To post to this group, send email to dec...@googlegroups.com.
To unsubscribe from this group, send email to decapod+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/decapod?hl=en.

Reply all

Reply to author

Forward