german pre-1900 wordlist needed?

111 views
Skip to first unread message

Andreas Romeyke

unread,
Nov 16, 2014, 5:32:54 AM11/16/14
to tesser...@googlegroups.com
Dear tesseract developers,

from my project "Bunte Bilder aus dem Sachsenlande" (see german blog articles: http://art1pirat.blogspot.de/search/label/tesseract)   I have created a fully corrected wordlist for tesseract. The wordlist comes with the long-s coded words and should be used to train deu-frak. Also, my wordlist is part of the "pre1900" list of the german TeX-project "Trennmuster"  http://projekte.dante.de/Trennmuster, the source could be used to produce very large wordlists for all german epochs from 1830 until now. See http://projekte.dante.de/Trennmuster/RepoHaupt for details.

Because I have also trained the OCR with 'long-s', please let me know if you want my training material.

At least I thank you for your great job,

With best regards

Andreas

Janusz S. Bien

unread,
Nov 16, 2014, 8:37:40 AM11/16/14
to tesser...@googlegroups.com
Quote/Cytat - Andreas Romeyke <art1...@googlemail.com> (Sun 16 Nov
2014 11:32:54 AM CET):
I am definitely interested, as I just started to experiment with
recognizing German Fraktur passages in the 19th century dictionary by
Linde (cf. http://wbl.klf.uw.edu.pl/75/ or the scans in Bayerische
StaatsBibliothek).

Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Helmut Wollmersdorfer

unread,
Dec 8, 2014, 6:55:41 AM12/8/14
to tesser...@googlegroups.com


Am Sonntag, 16. November 2014 11:32:54 UTC+1 schrieb Andreas Romeyke:
Dear tesseract developers,

from my project "Bunte Bilder aus dem Sachsenlande" (see german blog articles: http://art1pirat.blogspot.de/search/label/tesseract)   I have created a fully corrected wordlist for tesseract. The wordlist comes with the long-s coded words and should be used to train deu-frak. Also, my wordlist is part of the "pre1900" list of the german TeX-project "Trennmuster"  http://projekte.dante.de/Trennmuster, the source could be used to produce very large wordlists for all german epochs from 1830 until now. See http://projekte.dante.de/Trennmuster/RepoHaupt for details.

Yes, that would be interesting.

Please flag your wordlists appropriate with the classification of the orthographic "milestones", which should follow the history


For earlier texts it's hard to classify an orthography, at least each book can have its own.


Reply all
Reply to author
Forward
0 new messages