OCRopus 0.7 released

711 views
Skip to first unread message

Tom

unread,
Apr 10, 2013, 1:23:39 AM4/10/13
to ocr...@googlegroups.com
OCRopus 0.7 has been released. Please see the ocropus.org home page for more information.

Features:

- a new text line recognizer based on recurrent neural networks (and does not require language modeling)

- models for both Latin script and Fraktur

- new tools for ground truth labeling

Text line recognition error rates are much lower than with previous systems, in many cases beating commercial systems in benchmarks on standard databases. Training for new scripts has also been simplified greatly.

Tom

Sriranga(78yrsold)

unread,
Apr 11, 2013, 5:06:44 AM4/11/13
to ocr...@googlegroups.com
Kindly confirm whether new release ocropus 0.7 will have support for kannada lang (UTF-8)? For this purpose I also installed ubuntu 12.10 as suggested for new release. Because last version ocropus 0.6 did not work well for kannada lang when tested.
-sriranga(79yrs)


Tom

--
You received this message because you are subscribed to the Google Groups "ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ocropus+u...@googlegroups.com.
To post to this group, send email to ocr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/y1uIZQTwfbQJ.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tom

unread,
Apr 11, 2013, 11:36:57 PM4/11/13
to ocr...@googlegroups.com
You should be able to train it on Kannada in principle; all the string handling is in terms of Unicode/UTF-8 (and is very simple anyway in the new version).

However, given that Kannada is a fairly complex script, you may run into some script-related issues (e.g., with how diacritics are encoded). You may have to modify the default Unicode representation slightly, break up ligatures, etc. The only way to know is to give it a try.

Tom
Message has been deleted

79yrsold

unread,
Apr 22, 2013, 9:10:54 AM4/22/13
to ocr...@googlegroups.com
 pl see inline comments below
with regards, sriranga(79yrs)


On Friday, April 12, 2013 9:06:57 AM UTC+5:30, Tom wrote:
You should be able to train it on Kannada in principle; all the string handling is in terms of Unicode/UTF-8 (and is very simple anyway in the new version).
So far not able to generate in kannada even for  simple script ./1run-test as an example. I find all py programs in the ocropy does not have encoding for utf-8.
Kindly eloborate little how
"is very simple anyway in the new version". to enable me
to test further?

However, given that Kannada is a fairly complex script, you may run into some script-related issues (e.g., with how diacritics are encoded). You may have to modify the default Unicode representation slightly, break up ligatures, etc. The only way to know is to give it a try.
Agree with your logic. First at least  I should be able to view the output in kannada script irrespective of slightly, break up ligatures, etc
Tom

79yrsold

unread,
Apr 22, 2013, 10:00:02 AM4/22/13
to ocr...@googlegroups.com

 Searched for models for latin script and fraktur could not find/located, since both was not installed on my computer. I remember I had installed fraktur  and tested also while in ocropus 0.6. Where is Latin and fraktur scripts?
Message has been deleted
Message has been deleted

79yrsold

unread,
Apr 26, 2013, 7:52:46 AM4/26/13
to ocr...@googlegroups.com

 I searching for  models for latin  and fraktur ?  Please  guide me where is it?
Just now I have completed the  upgrade the ubuntu 12.04 to ubuntu 13.04 and is working fine. Also tested by running script ./run-test. Output was fine. Hope ubuntu 13.04 will support the ocropus 0.7 also.
 Thanks for the help.

ar

unread,
Apr 26, 2013, 10:09:56 AM4/26/13
to ocr...@googlegroups.com
Hello Tom,

I checked the new ocropus version, and the recognition results are fine, see http://art1pirat.blogspot.de/2013/04/ocropus-07-erste-tests.html (in German). But I have two questions.

 First, if I remember me correctly, previous versions of ocropus were able to use tesseract 3.02 training files. Is it possible to train ocropus0.7 with these files, too?

Second, the fraktur example does not support  'long-s', therefore words like

'Wachstube' vs. 'Wachſtube' could be problematic in historical texts.

Because in my personal project I digitize a book from 20th century with fraktur I could send you some full corrected wordlists and pages.

Please send me anote, if this could be helpful

With best regards


Andreas


Brad Hards

unread,
Apr 26, 2013, 8:40:19 PM4/26/13
to ocr...@googlegroups.com, 79yrsold
On Friday 26 April 2013 21:52:46 79yrsold wrote:
> I searching for models for latin and fraktur ? Please guide me where
> is it?
There are special instructions for downloading the models in the install
instructions (at http://code.google.com/p/ocropus/)

Note the step that says:
python setup.py download_models

I think you may already have this if the runtest step worked OK for you, but
if not, be prepared for a fairly long download (it took about an hour for me
IIRC).

Hope this helps

Brad

Tom

unread,
May 1, 2013, 9:01:47 PM5/1/13
to ocr...@googlegroups.com
 First, if I remember me correctly, previous versions of ocropus were able to use tesseract 3.02 training files. Is it possible to train ocropus0.7 with these files, too?

OCRopus 0.7 doesn't need to be trained with individual characters, so you don't really need the Tesseract training files. But you should be able to use the scans that those files were derived from easily.
 
Second, the fraktur example does not support  'long-s', therefore words like 

'Wachstube' vs. 'Wachſtube' could be problematic in historical texts.

It should support long-s, but it doesn't encode it separately in the output.
 

Because in my personal project I digitize a book from 20th century with fraktur I could send you some full corrected wordlists and pages.

I would recommend just recognizing it with the default Fraktur model and choosing long/short s based on context; there are very few cases where the choice can't be made programmatically, and you should be able to find those with a simple script.
 
Tom

Andreas Romeyke

unread,
May 2, 2013, 7:05:15 AM5/2/13
to ocr...@googlegroups.com
Hello Tom,

Thanks for your answer.

OCRopus 0.7 doesn't need to be trained with individual characters, so you don't really need the Tesseract training files. But you should be able to use the scans that those files were derived from easily.

Hmm, Not really. Because my tesseract training pages are not splitted up in pages of single lines. Or could I train ocropus with a whole page and corresponding text? The thing is, I would use a set of training pages without specific modifications for tesseract and ocropus, too.
 
Second, the fraktur example does not support  'long-s', therefore words like 

'Wachstube' vs. 'Wachſtube' could be problematic in historical texts.

It should support long-s, but it doesn't encode it separately in the output.

That is a problem. I need the correct encoding of long-s. I want preserve the character 'ſ' in output.  It should not be substituted with 's'. Same for »«, „“ and so on. But that should not be a problem if I train my own models, right?
 
 

Tom

unread,
May 15, 2013, 5:39:55 AM5/15/13
to ocr...@googlegroups.com, art1...@googlemail.com
On Thursday, May 2, 2013 1:05:15 PM UTC+2, Andreas Romeyke wrote:
Hello Tom,

Thanks for your answer.

OCRopus 0.7 doesn't need to be trained with individual characters, so you don't really need the Tesseract training files. But you should be able to use the scans that those files were derived from easily.

Hmm, Not really. Because my tesseract training pages are not splitted up in pages of single lines. Or could I train ocropus with a whole page and corresponding text? The thing is, I would use a set of training pages without specific modifications for tesseract and ocropus, too.

The basic training for OCRopus is text lines and corresponding transcriptions. 

 
  It should support long-s, but it doesn't encode it separately in the output.

That is a problem. I need the correct encoding of long-s. I want preserve the character 'ſ' in output.  It should not be substituted with 's'. Same for »«, „“ and so on. But that should not be a problem if I train my own models, right?

Yes, you can train your own models, but you need to generate ground truth containing that information. We don't usually do that because different sources treat these cases differently,  so if we want to maximize training data, we just use the lowest common denominator text normalization.

Tom 
Reply all
Reply to author
Forward
0 new messages