Hello guys.
I want to add new language script to Tesseract OCR and researching to training data.
Then I want to know below things.
No tool I can think of. What I would do is edit the file in a large text file editor (such as EmEditor) to remove duplicate words. You could do this by replacing all spaces for newlines then sorting and removing duplicates. After that you can randomize the unique list of words, add an appropriate distribution of punctuation characters and re-edit to create a block of text wrapped at say 100 characters. There are online tools to do the randomizing and wrapping.Having said this I don't know how valuable it is to have training text containing specific words. I have been struggling myself to train on specific word lists without much success. I think training text is just about a representative distribution of characters. Please let me know if you have any insights on the wordlists in langdata as I'm a bit hazy there.ThanksJames
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.