Hi,
sorry for not responding earlier. I'm still training the new line recognizer. The error rate is cut in half relative to the old recognizer when measured on UW3 using preliminary models, and it may get a bit better yet. The final recognizer will take a few more weeks of training and testing (it's largely an automatic process). To get to this point, there I wrote an entirely new classifier, plus a new testing infrastructure, and a lot of data wrangling. The display, editing, and data interchange are now based on HDF5 (much faster than sqlite). Pluse there has been a lot of refactoring, bug fixing, etc. The language modeling and alignment code has also been rewritten.
I'm still not sure exactly what form I'm going to push it out in; right now, it's separate from the ocropy package, and I may leave it that way, or I may integrate it. There has also been a lot of refactoring in other parts of OCRopus that affect installation. However, the command lines have generally remained the same.
Here is how that may (or may not) affect these bugs:
1.
DECA-238: Type 2 PDFs have poor OCR results with reasonably captured documents -- This is probably due to resolution issues. It may or may not be fixed by the new recognizer (the new recognizer is more robust to scale changes than the old one).
2.
DECA-58: Export to PDF skips over pages that do not have detected characters -- This is part of the page segmentation and would need to be addressed in ocropus-binarize. The binarizer is now a standalone command line Python program that should be easier to modify.
3.
DECA-211: Certain PNG/JPG files create colour inverted PDF -- There was some automatic logic in the old binarizer for detecting inverted pages. It should probably just be disabled.
Tom