Latin language

124 views
Skip to first unread message

Guido Milanese

unread,
Nov 20, 2014, 6:40:39 PM11/20/14
to tesser...@googlegroups.com
I am a regular user of tesseract and it's an essential tool for my daily work, so thank you, before anything else. The support for Ancient Greek is simply superb -- works like a charm. I did not find a support for Latin -- I mean the Latin language, not the Latin alphabet. Is there any project for this?

Thank you very much for your kind attention.
guido, italy

Helmut Wollmersdorfer

unread,
Nov 21, 2014, 6:44:46 AM11/21/14
to tesser...@googlegroups.com
Would be nice to have for me too, because of old scientific (zoological, botanic) texts, which mostly contain Latin and Greek besides the native language.

Do you have a good Latin dictionary for training?

Helmut Wollmersdorfer

Ryan Baumann

unread,
Nov 21, 2014, 4:12:17 PM11/21/14
to tesser...@googlegroups.com
Coincidentally, I recently began looking into this for my own use. I decided the easiest couse would probably be to adapt the excellent, open work done by Nick White for Ancient Greek. Unfortunately I'm not very far along yet, as part of the first steps are making sure I can correctly replicate the existing process for Ancient Greek on my own machine (the mftraining step in the grc repository seems to be taking quite some time).

You can find my work-in-progress here: https://github.com/ryanfb/latinocr-lattraining

Right now that should just build you (from the same Perseus sources):

- training_text.txt
- lat.word.txt
- lat.freq.txt
- lat.unicharambigs
- lat.wordlist

Note that this is very initial, as I've just trivially altered it at this point so that I can start figuring out what I need to clean up in the input/processing.

Note also that there's a modification here in tools/wordlistfromperseus.sh to strip <foreign> tags instead of skipping files with foreign words altogether. I think this would help Ancient Greek as well (though how much it will improve or alter overall accuracy I don't know). For Greek, this change results in the wordlist being 7202347 lines for me instead of 5605967, or a 28% increase in the size of the corpus. I originally did this with Saxon/XSLT, but the processing was slow, so I switched to using Perl so I could apply a non-greedy regex substitution instead (which is much faster): https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77

-Ryan 

Guido Milanese

unread,
Nov 22, 2014, 4:15:12 AM11/22/14
to tesser...@googlegroups.com
Thank you for you very promising answer. Would you please tell me/us how to co-operate in you project?

Best wishes,
guido milanese

Ryan Baumann

unread,
Nov 24, 2014, 11:16:01 AM11/24/14
to tesser...@googlegroups.com
Pull requests or patches are more than welcome, as I'm just getting familiar with the Tesseract training process myself. I've just pushed a few changes to get possibly-better output for the training_text and word/frequency files, but incorporating Latin-specific changes for unicharambigs may be something where someone with more domain-specific knowledge of both Latin and Tesseract will be able to do a better job than me. Due to the upcoming US holidays, I probably won't be able to do much more work on it this week.

Best,
-Ryan

Ryan Baumann

unread,
Dec 16, 2014, 4:47:41 PM12/16/14
to tesser...@googlegroups.com
I've resumed working on this some this week, but the bottleneck of the mftraining process really makes the feedback loop of tweak/train/test/repeat quite slow:


I've incorporated the Latin from Bruce Robertson's Greek/Latin spellcheck dictionary in his "rigaudon" OCR repository (https://github.com/brobertson/rigaudon/), a process that might also be portable back to Ancient Greek (though not for frequency, as the Greek lacks frequency data). Right now I'm tweaking the process to try to see what works and what doesn't for various ligatures and the notoriously tricky long s. I've also updated the repos with conditional runtime code for running on a Mac, so that I won't have to spend as much time doing complicated branch management.

Also, if there are any particular (open/free) fonts that you think would be helpful with training for texts typically printed in Latin, I would love to hear about them so I can incorporate them into the training process. I've added Cardo (a free Bembo-style font with wide coverage) and some Fell fonts I came a cross (http://iginomarini.com/fell/the-revival-fonts/), as well as retaining some of the GFS fonts. Right now I'm not training on bold/italic variants until I'm pretty confident I've ironed out any other issues with the training process. I've also pulled macrons out of allchars.txt for the same reason, figuring I can add them back in later while leaving them on the tessedit_char_blacklist.

-Ryan
Reply all
Reply to author
Forward
0 new messages