The steps mentioned here for [tessercat 3.0-3.02][
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02 ] is not clear nor I could find any clear documentation about that:
It is mentioned that the following dataset is required:
tessdata/eng.config
tessdata/eng.unicharset
tessdata/eng.unicharambigs
tessdata/eng.inttemp
tessdata/eng.pffmtable
tessdata/eng.normproto
tessdata/eng.punc-dawg
tessdata/eng.word-dawg
tessdata/eng.number-dawg
tessdata/eng.freq-dawg
But, didn't explained what are the formats or what they actually are?
The language I am working on is not included in utf-8, but is in utf-16, though it has its official unicode code-point range.
From what I understood so far,
eng.word-dawg : I need to create a text file
mylang.txt with one word in each line. Words will in the language in which I am working on and the letters too. And then convert a
dawg file. I assume the command for that is
wordlist2dawg mylang.txt mylang.word-dawgeng.number-dawg : Create a text file
mylangnum.txt with the numerical characters - one in each line (0 to 9). Then covert it to
mylang.number-dawgeng.freq-dawg : Same step as
eng.word-dawg file, but with the most frequent words ( frequent words could be retrieved for example after processing a certain dataset like newspaper dataset ) starting with the most frequent word in first line ( no need for frequency) then followed by the next frequent word in second line and so on.
I don't know about the rest of the 7 remaining files.
Could someone please direct me to better tutorial to add a new language in tesseract.
OR. Verify my above assumption and tell me about the remaining 7 files. And how to proceed further after having all the 10 files.
is still bit confusing to me.
Working with python on Ubuntu 16.04 LTS, tesseract version 3.04.01 ( installed with sudo apt install tesseract-ocr , and is working perfectly for english language)
I am new in this field, sorry if I made any mistake.
If the requirement is to upgrade the tesseract to version 4 first. Then, do I need to uninstall the previous pervious version or override with some update command ? ( will the PPA of alex-tesseract 4 will work for overriduing the version?)
Thank you.