sin.numbers
This file include all the number characters used in Sinhala.
sin.singles_text
Similar file to wordlist. Contains unique words followed by a new line
sin.unicharset
This file will be created when creating training data
sin.wordlist
Contains unique words followed by a new line
--Hi,
I downloaded latest lstm langdata from tesseract repository. I found it consists of a lot of false data for Sinhala. I'm trying to train tesseract for Sinhala. According to tesseract wiki guidelines, we need to create lang data before creating training data using tesstrain.sh script. I'm referring to the below wiki guidelines,
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
I couldn't find proper wiki guidelines on creating lang data. When I inspected the 'sin' folder in langdata-lstm I found the below files,
- desired_characters
- okfonts.txt
- sin.numbers
- sin.punc
- sin.singles_text
- sin.training_text
- sin.unicharset
- sin.wordlist
Please let me know if there's a proper documentation that I can follow if I create these files on my own from the scratch. According to my observations I have the following idea of these files. If there's no any proper documentation of them please correct me if I mention anything wrong here,
desired_characters
This file contains all the unique characters found in the language. Each character followed by new line. My question is Sinhala language has many vowel characters that create compound characters with Sinhala consonants. Unlike English once a vowel character is attached to a consonant it creates a single compound character most of the time which I can erase from a single keyboard backspace. Please refer to the below example,
Example 1:
Consonant : ද
Vowel character : ො
Compound character : දො
Example 2:
Consonent : බ
Vowel character : ්
Compound character : බ්
So each consonant + different vowel characters it makes a lot of compound characters. Should I enter all those compound character combinations to this file?
okfonts.txt
This file includes the fonts I use in my training_text. Format is font name followed by a new line. Can I include non Unicode fonts into this file?
sin.numbers
This file include all the number characters used in Sinhala. Number character followed by a new line. Normally this contains only 10 characters
sin.punc
This character contains all the punctuation characters that can be used in Sinhala text. Format is punctuation character followed by a new line. In lang data this contains punctuation combinations. Please explain why?
sin.singles_text
Similar file to wordlist. Contains unique words followed by a new line
sin.training_text
Training text to be used when creating training data. Should contain around 40000 text lines. Each line can have any amount of characters. It’s better if this document contains text in multiple fonts that we have defined in okfonts.txt. (These fonts can be passed as a command line argument as well)
sin.unicharset
This file will be created when creating training data
sin.wordlist
Contains unique words followed by a new line
Appreciate your response on this.Thanks
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVwUEM4SmsO8nSVwB76wsmdbzynwcK8-30_cDnEawW2Gg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAKOih%3DkGPrkU%2BCm1KFooKyUknQjR0i8zxf6ZrHzZkLg9vwNjfA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU58m%3DYybzA-sEMbEp80Zzke%3DYKK4YfKAG34zpaFy2Xww%40mail.gmail.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com up
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com up