Tom--
You received this message because you are subscribed to the Google Groups "ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ocropus+u...@googlegroups.com.
To post to this group, send email to ocr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/y1uIZQTwfbQJ.
For more options, visit https://groups.google.com/groups/opt_out.
You should be able to train it on Kannada in principle; all the string handling is in terms of Unicode/UTF-8 (and is very simple anyway in the new version).
However, given that Kannada is a fairly complex script, you may run into some script-related issues (e.g., with how diacritics are encoded). You may have to modify the default Unicode representation slightly, break up ligatures, etc. The only way to know is to give it a try.
Agree with your logic. First at least I should be able to view the output in kannada script irrespective of slightly, break up ligatures, etcTom
'Wachstube' vs. 'Wachſtube' could be problematic in historical texts.
Because in my personal project I digitize a book from 20th century with fraktur I could send you some full corrected wordlists and pages.
Please send me anote, if this could be helpful
With best regards
Andreas
First, if I remember me correctly, previous versions of ocropus were able to use tesseract 3.02 training files. Is it possible to train ocropus0.7 with these files, too?
Second, the fraktur example does not support 'long-s', therefore words like'Wachstube' vs. 'Wachſtube' could be problematic in historical texts.
Because in my personal project I digitize a book from 20th century with fraktur I could send you some full corrected wordlists and pages.
Second, the fraktur example does not support 'long-s', therefore words like'Wachstube' vs. 'Wachſtube' could be problematic in historical texts.
It should support long-s, but it doesn't encode it separately in the output.
Hello Tom,Thanks for your answer.OCRopus 0.7 doesn't need to be trained with individual characters, so you don't really need the Tesseract training files. But you should be able to use the scans that those files were derived from easily.
Hmm, Not really. Because my tesseract training pages are not splitted up in pages of single lines. Or could I train ocropus with a whole page and corresponding text? The thing is, I would use a set of training pages without specific modifications for tesseract and ocropus, too.
It should support long-s, but it doesn't encode it separately in the output.That is a problem. I need the correct encoding of long-s. I want preserve the character 'ſ' in output. It should not be substituted with 's'. Same for »«, „“ and so on. But that should not be a problem if I train my own models, right?