tesseract-ocr-ell, tesseract-ocr-grc: improvements

40 views
Skip to first unread message

dimitrDimitr

unread,
Jul 31, 2017, 12:28:25 PM7/31/17
to tesseract-ocr
At http://www.elspell.gr/myspell there is OpenOffice Greek Dictionary v0.9 with 800.000 greek words encoded with windows-1253, under MPL 1.1/GPL 2.0/LGPL 2.1 License.

Polytonic characters aren't used after 1982 and we don't have wordlists for them. 

Only sources like the Bible have polytonic words but they don't belong to modern greek. 

The maintainer of tesseract-ocr-grc uses a wordlist based on ancient greek polytonic texts.

The greek polytonic unicode characters U+1F00 to U+1FFC aren't useful in the packet tesseract-ocr-ell, and they may confuse ocr recognition.

On the opposite side tesseract-ocr-grc must have the polytonic characters and not the monotonic greek characters U+0386 to U+03CE.



Reply all
Reply to author
Forward
0 new messages