German language models

301 views
Skip to first unread message

Christian Pietsch

unread,
Nov 15, 2012, 4:08:53 AM11/15/12
to ocr...@googlegroups.com
I would like to OCR scans of documents written in German. Surely someone has already created Ocropus models for German, but during installation, only English models are downloaded. Please let me know about models that are free to use at a German university.

Alternatively, is it still possible to use Ocropus as a preprocessor for Tesseract? The --tesslanguage option seems to have gone.

Thank you!
Christian

Tom

unread,
Nov 20, 2012, 2:06:05 AM11/20/12
to ocr...@googlegroups.com
OCRopus 0.6 doesn't have any models for German.  But the upcoming version of OCRopus (0.7) has good support for German, as well as German Fraktur.

If you want to use Tesseract, the easiest way may be to download pytess from http://code.google.com/p/pytess

It includes a drop-in replacement for the regular OCRopus line recognizers; this uses the Tesseract line recognition mode:

$ tess-lines 'book/*/*.bin.png' -q
$ ocropus-econf 'book/*/*.gt.txt'
errors         322
missing          0
total        23542
err          1.368 %
errnomiss    1.368 %
0.0136776824399

Tom

Sriranga(78yrsold)

unread,
Nov 20, 2012, 2:37:47 AM11/20/12
to ocr...@googlegroups.com
Tom,
I am curious and interested to know whether
"pytess" supported the Kannada script(utf-8)?
If so, I wanted to test the same and feedback to you.
With Warmest Regards,
-sriranga(79yrs)

--
You received this message because you are subscribed to the Google Groups "ocropus" group.
To post to this group, send email to ocr...@googlegroups.com.
To unsubscribe from this group, send email to ocropus+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/tnUZSnoKaEgJ.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Sriranga(78yrsold)

unread,
Nov 20, 2012, 2:50:25 AM11/20/12
to ocr...@googlegroups.com
Tom,
Trust upcoming version of ocropus (0.7) will have support for Kannada script(utf-8) also?

 I am curious to know when next version ocoropus(0.7) will likely be uploaded
With warmest Regards,
-sriranga(79yrs)

On Tue, Nov 20, 2012 at 12:36 PM, Tom <tmb...@gmail.com> wrote:

--

Sriranga(78yrsold)

unread,
Nov 20, 2012, 8:08:10 AM11/20/12
to ocr...@googlegroups.com
Tom,
dell64@ubuntu:~/pytess$ ./tess-lines ocropus/kannada-boxes/page*/*.bin.png -q
Traceback (most recent call last):
  File "./tess-lines", line 5, in <module>
    import tess
ImportError: No module named tess

Reg: ocropus-econf - this commandline is not available under folder ocropus/ocorpy. this is brought to your kind notice and correct me if I made mistake.
with regards,
-sriranga(79yrs)


On Tue, Nov 20, 2012 at 12:36 PM, Tom <tmb...@gmail.com> wrote:

--

Tom

unread,
Nov 20, 2012, 9:02:34 AM11/20/12
to ocr...@googlegroups.com
It's in the repository tip, not the release.

Tom

Sriranga(78yrsold)

unread,
Nov 20, 2012, 11:31:57 AM11/20/12
to ocr...@googlegroups.com
Tom,
i could not follow. Kindly guide me
with regards,
-sriranga(79yrs)

To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/ezAzSH2uDL4J.

Brad Hards

unread,
Nov 20, 2012, 3:17:19 PM11/20/12
to ocr...@googlegroups.com, Sriranga(78yrsold)
On Wednesday 21 November 2012 03:31:57 Sriranga(78yrsold) wrote:
> Tom,
> i could not follow. Kindly guide me
> with regards,
> -sriranga(79yrs)
>
> On Tue, Nov 20, 2012 at 7:32 PM, Tom <tmb...@gmail.com> wrote:
> > It's in the repository tip, not the release.
I think Tom is trying to say that the functionality that you are trying to use
is not in a release version, and that you would need to use the developmental
version (from the source code repository, not a zip file) if you'd like it
before the next release.

The tool to fetch from the repository is called "mercurial". There are step-
by-step instructions at http://code.google.com/p/ocropus/, however it looks
like they are a little out-of-date (since that checkout should probably now be
ocropus-0.7 instead of ocropus-0.6). That shouldn't actually affect the code
though (its just a name).

Brad

Brad Hards

unread,
Nov 21, 2012, 3:00:26 PM11/21/12
to Sriranga(78yrsold), ocr...@googlegroups.com
On Thursday 22 November 2012 00:04:36 Sriranga(78yrsold) wrote:
> Brad,
> ** *** ** **
> page/0100cd.bin.png =RAW= ಪಂಚಿ೦ಂಗಚ4್. ಇವ ್ ಎ:್ರ ನಂ0ಿ್ಯ ವ್ೋೂಎ ಝೋಲಂಘ್4ಸ.
> ಖ೭ೋ'; "ಧ೦ವ ್ ಂ೭1ಔ 4್ಡಘ0ಚಚ 0೦ಿ೧ಯತವಎ ಟ/ಹಧಣರಎ ಈ ಬಿ! ಅವರನ
> page/0100ce.bin.png =RAW= ವ್ಧಧಕಎ೦ಡಚ4್.
> page/0100cf.bin.png =RAW= ಇ೦ತಹ ಅಪರ೦ಝಯ ಷ0೦ಝ ನಂನಎ ಒಧೃ ಶಂ೩ಡಹಂ6ಯೋ? ಇ೦ಿ!೦ಬ'
> ಚಿಿಂಣದ ಸ್';ಪೋ ಇ:್ಡದಿಗಓ?ೋ ನಂನ ್ ಕಎೋ';ತಚ "ಧ9ಉ ಕ್4್ಷಣಡಚ.
> page/0100d0.bin.png =RAW= ಆಈ೭ೋಗಿದ ದ್್ುಯೋ2 ಅದರ ಅವಶಿಕಝ ಇ9 ನ9 ಝ೦ಡ"ಧಿ" ಝೋಚ1ತ‘
> ಬ೦ಡಚ.
> page/0100d1.bin.png =RAW= ಂಲ4) ವಷೄ ನ9 ರೋಧಂತನಂಗಚ ಧಂಝೋಮೋಧರ ಈಶರ೨ೄ, ಟಂಗಲಕಎೋಾಯೋಧ
> ನ೦ಡೋಮರರ ಶಷಿನಂಗದ್ದನ ರ್ರ್;ಧಿಧಕಎ೦ಘ್, ಅವರ ಸ೦ಸಔ ್ತ ಧಂೃ1ಯ
> page/0100d2.bin.png =RAW= ಇ೦ಧ್್ನ ಷಣದ6ಗಳನ ಆಿಬಗ ಕ0ದಂಹ6ಸ್ಚ0ಚ4್.
> page/0100d3.bin.png =RAW= ಸದಂೄೆ ' ಒ:ಎೋ4್ಗಳ೦ಔ ? ಇಫಿ೭ ಅಧೄ ಸತಿ, ಅಧೄ ಂ್ಥಿ ಆಗರಉ
> ಸಂಧಿ. ಆದಝ ಇ೦ತಹ ಜ:ಎೋ4್ಗಳನ ಂೋ'; ಆನ೦ಡಸದವಝೋ ಇ:್. ಅವರೋಧ ನಂನಎ ಒಧೃ.
> page/0100d4.bin.png =RAW= ಒವಧಧಿಾ4್ಗಳ ಮರ್ಯ ಟಂಗಲನ ಎರಷ ್ ಷ್ಎ4 ್ ಟಂ6 ತೃಧದರಎ
> ಒಳಗ೨೦ದ ೇ೦ದಪ ಟಂರಡದ್ದ6೦ದ, ಗೃಧಿಂಗ ಮಖಧವಧಧಟಂಗಉ ತೃಧೆಸಿೆ
> page/0100d5.bin.png =RAW= ಎ೦ವ ್ ವಎೀಡಯೋ2 ಈಶರ೨ೄ ಕಎಗದರ೦ಝ. 9:ು ಸಮಯದ ನ೦ತರ ಒಳಗ೨೦ದ
> 'ಪೋಎು. ಪೋಎು. ಬಿಂಿಮಧ ಪಂಿ೦ಝ೦ಿ", ಬ ಓಪ೨ೄ ಷಿ/ಣ, ಎ೦ಬ
> page/0100d6.bin.png =RAW= ಡ ಬ೦ತ೦ಝ.
> page/0100d7.bin.png =RAW= ಸದಂ ಕೀಪ೦ಂ ಧ್ಷಣಉಂನೋಧಧ್ಚಚ ಫಿ)ಝಸೆ ಒಮಧವಧಧಘನದರಎ ಪಂಿ೦ಟ'
> ಹಂ&ಖ೭ಳಧ4)ದನ ಇ೦ಿ!೦ಡನೋಧ4್ವಂಗಓ?ೋ ಕ0"ಧಚಧ ್ ಎ೦ಒ೨ವ ್ ಈಶವರ೨ೄ
> page/0100d8.bin.png =RAW= ವಂದ. ಅ೦ದಝ ಪಧಮ ಸ೦ಸಔ ್"ಧಯ೭ಡಪ ಬಹಳವೋ.್ಡಚರಎ ೇ:ುವಂದರಎ
> ಖ೭೦ದಂ;್ಂ ಷಣಡಕಎ೦ಡಚ4 ್ ಎ೦ಒ೨ದಝಔ ಇವ ್ ಕ0ದಂಹರಧ್ಘಣೇಿಸ.
> page/0100d9.bin.png =RAW= ಇವ ್ ಏರ್ೋ ಇರ0, ಲ೦ಡ೨ೄ ಂಸಂದಂಿಲಯದ ;ಧಎಬ' .ಡ.ಯ೭ಡರ್
> ಬ೦ಡಚ ನ೦ಡೋಮರ4 ್ ಇ೦ಧಎೄ ಏಂಪಯ:ಪಧ ಬಝಯಬ:್ವ೦ಂಗಚಝ೦ಒ೨ದಝಔ ಅವ್ಹಂಿ೦ಬ'
> page/0100da.bin.png =RAW= ಒ೨ಔ' ಆ~ಂೋರಝಿಂಜ೦ ' ಸಂ್ಿಂೇಿಸ. ಇವ ್ ಅವರ ಮಹಂ ಚಿಬ೦ಧದ
> ಒ೦ವ ್ ವಎೀ ಮಣಚ4್.
> page/0100db.bin.png =RAW= ಇವರನ ೧ೃುಝ, ಇ೦ಿ!೦ಡನೋಧ ಅಏಂಿಸ ಷಣಡದವರೋಧ ಎದ್ ಕಂಣಚವ
> ಕ0ಕರ ಕನಂೄಟಕದ ಇಧೃಧ ್ ವಿ5ಗ~೦ದಝ, ಆಸ ಎೇ(ೄೋಧ ಓಡದ ಂ.ಂ. ಿಣೋಕಂಕ4 ್ ಮಸ
> page/0100dc.bin.png =RAW= ಂೋ೦~9(ೋಧ ಓಡದ ಡ.ಿಧ ಪಂವಾಯವ4್.
> page/0100dd.bin.png =RAW= ಇವ6ಧೃರ ಕಂಓ?ೋಉ ಡನಗಳ ಬಿ! "ಧ';ದವರಂವಎ ನನಿ್
> ಪ6ಚಯವಂಗರ0:್. ಕಂರಣ ಆ ಬಿ! ನನಿ" ಝಷಿ ಿಣಚ:್. ಓ?ೋಖಕ೦ಂಗ ಿಣೋಕಂಔ',
> page/0100de.bin.png =RAW= ಆಡ';ತಿಬರರಂಗ ಪಂವಾ ಝೋಗಚಝ೦ಒ೨ದನ ನಎೋಷ್ವ ಅವಕಂಶ ಷಣಚ ಿ
> ನನಿ್ ವಎರ&ಸ. ಿಣೋಕಂಕರ ಶೀವಗೄ ಅಫಂರ. ಪಂವಾಯವರ ಅಭಷಣ೨ ವಗೄ
> page/0100df.bin.png =RAW= ಅದ&ಔ೦ತ ಝಧಧನವ್.
> page/0100e0.bin.png =RAW= ಾಫಎೋಏ;ನೋಧ ಇವ6ಧೃವಎ ವಎದಲ ದಒ?ೄಯೋಧ ಕ0ಚೋಣೄರಂಗಚ4್. ೇ:ು
> ಷ0ು ವಂತಂವರಣದೋಧಯೋ ಡನಗಳನ ಕ~ಡಚ ಇವ6ಿ್ ಸ೦~ದಂಯಝಔ
> page/0100e1.bin.png =RAW= ಅ೦ಝಕಎ೦ಝೋ ಝೋವನ ಕ~ಘ0ವ ಅವಶಿಕಝ ಇರ0:್ಪೋರಣೋ. ಆದಝ ನಂನ ್
> ಕ೦ಡಂಗ ಇವ6ಧೃವಎ ಮರ'; ಸ೦ೀಿದಂಯಝಝಿ್ ವಂಉ4ಚ4್.
> page/0100e2.bin.png =RAW= ಪಂವಾಯವ4 ್ ಆಿ೧ಗ ಂಭಎ"ಧ ಧ6ಿಧಯೋ ಸಒ?ಗ';ಿ್ ಬ4್ಚಚ4್.
> ಿಣೋಕಂಕ4 ್ ಪಂಎ್ಟಂಟಂ ಆೀಮ ಪೋ694್. ಪಂವಾಯವರ ಬಿ! ನಮಿ್ ಆಗ "ಧ';ಡಚ
> page/0100e3.bin.png =RAW= ಆದಝ ಅಥೄವಂಗದ ಂಝೋಷಝ ಎ೦ದಝ ಅವರ ಝಸ6ನಎಡರ್ ಸಂಷಣನಿವಂಗ
> ಪೋ6ಸಲಂಿಾಚ9 'ಿಂೄ೦ಿ:ೆ ' ಪದ.
> page/0100e4.bin.png =RAW= ಅ೦ಡನ ಕ0ಚರ ಕನಂೄಟಕದ ಸಷಣಜಝಔ ಈ ಪದವಣ ಪ6ಚಎ್ಿಧಚೋ ಅವ4್.
> ಇಝೋವ ್ ೧4್ವಎೋ, ಡಿಯ೭ೋ, ಎ೦ಒ೨ವ ್ ಬಷ್ಝೋಕ6ಿ್ ಿಣ4ರ0:್, ಈಗೂಎ
> page/0100e5.bin.png =RAW= ಿಣ4:್. ಅದನ ೇತ್ ಪಂವಾಯವಝೋ ಬಳಿಧದ ಬಿ! ನನಿ್ ರ್ನ;ಧ:್.
> page/0100e6.bin.png =RAW= (ಷ0೦ವ್ವಝಘ04)ವ್)
> page/0100e7.bin.png =RAW= *** FAILED (no bestpath) ***
> + set +x
>
> ================================================================
> === You now have a simple Fraktur model, boxdata.cmodel.
> ===
> === This is only an initial model. It isn't using any baseline
> === information. The next training step consists of retraining
> === the model by aligning text lines with ground truth (see the
> === example in uw3-500).
> ===
> === In addition, you probably should construct a language model.
> === You can do that with ocropus-ngraphs.
> ================================================================
> dell64@ubuntu:~/ocropus/kan-boxes$
>
> How to proceed further using ocropus-ngraphs.- for which commandline to be
> used may kindly informed me. (I am not programmer/developer)
I do not know. Perhaps this discussion would be better posted to the mailing
list? I have included it on the address list.

Brad

zdenko podobny

unread,
Nov 20, 2012, 2:38:33 PM11/20/12
to ocr...@googlegroups.com
On Tue, Nov 20, 2012 at 8:06 AM, Tom <tmb...@gmail.com> wrote:
OCRopus 0.6 doesn't have any models for German.  But the upcoming version of OCRopus (0.7) has good support for German, as well as German Fraktur.

If you want to use Tesseract, the easiest way may be to download pytess from http://code.google.com/p/pytess

Another python swig base wrapper ;-)?
Please find attached improvement for build/installation process - tesseract header files could be located in /usr/local/include... IMO this is more universal approach...

--
Zdenko

Tom

unread,
Nov 25, 2012, 8:30:15 AM11/25/12
to ocr...@googlegroups.com
If you want to use Tesseract, the easiest way may be to download pytess from http://code.google.com/p/pytess

Another python swig base wrapper ;-)?

Yeah, the existing ones didn't work and/or were very complex.
 
Please find attached improvement for build/installation process - tesseract header files could be located in /usr/local/include... IMO this is more universal approach...

I didn't see anything attached; maybe you can clone, patch, and let me know.

Tom 

 

zdenko podobny

unread,
Nov 25, 2012, 8:43:26 AM11/25/12
to ocr...@googlegroups.com
It looks like I forget to attach it. Here it is. Basically: usage #include <tesseract/baseapi.h> will help if someone does not want to care where is tesseract installation (/usr, /usr/local, /opt...).

--
Zdenko
improve_build.patch

Tom Morris

unread,
Nov 25, 2012, 10:34:24 AM11/25/12
to ocr...@googlegroups.com
This sounds very similar to the bug report & patch that I submitted 5 days ago.

There's another bug report about the built library not having the execute bit set.

Tom
Reply all
Reply to author
Forward
0 new messages