Trainning tesseract for a new language from scratch that does not exist in Tesseract

83 views
Skip to first unread message

haru...@gmail.com

unread,
Mar 28, 2019, 2:32:30 PM3/28/19
to tesseract-ocr
The steps mentioned here for [tessercat 3.0-3.02][ https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02 ] is not clear nor I could find any clear documentation about that:

It is mentioned that the following dataset is required:

    tessdata/eng.config
    tessdata/eng.unicharset
    tessdata/eng.unicharambigs
    tessdata/eng.inttemp
    tessdata/eng.pffmtable
    tessdata/eng.normproto
    tessdata/eng.punc-dawg
    tessdata/eng.word-dawg
    tessdata/eng.number-dawg
    tessdata/eng.freq-dawg


But, didn't explained what are the formats or what they actually are?

The language I am working on is not included in utf-8, but is in utf-16, though it has its official unicode code-point range.

From what I understood so far,

eng.word-dawg : I need to create a text file mylang.txt with one word in each line. Words will in the language in which I am working on and the letters too. And then convert a dawg file. I assume the command for that is

    wordlist2dawg mylang.txt mylang.word-dawg

eng.number-dawg : Create a text file mylangnum.txt with the numerical characters - one in each line (0 to 9). Then covert it to mylang.number-dawg


eng.freq-dawg : Same step as eng.word-dawg file, but with the most frequent words ( frequent words could be retrieved for example after processing a certain dataset like newspaper dataset ) starting with the most frequent word in first line ( no need for frequency) then followed by the next frequent word in second line and so on.

I don't know about the rest of the 7 remaining files.

Could someone please direct me to better tutorial to add a new language in tesseract.

OR. Verify my above assumption and tell me about the remaining 7 files. And how to proceed further after having all the 10 files.

The steps : [tessercat 3.0-3.02][ https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02 ]

Generate Training Images

Make Box Files

Bootstrapping a new character set

Tif/Box pairs provided


is still bit confusing to me.

Working with python on Ubuntu 16.04 LTS, tesseract version 3.04.01 ( installed with sudo apt install tesseract-ocr , and is working perfectly for english language)
I am new in this field, sorry if I made any mistake.

If the requirement is to upgrade the tesseract to version 4 first. Then, do  I need to uninstall the previous pervious version or override with some update command ? ( will the PPA of alex-tesseract 4 will work for overriduing the version?)
Thank you.



Shree Devi Kumar

unread,
Mar 29, 2019, 12:52:22 AM3/29/19
to tesser...@googlegroups.com
For tesseract 3, and training language similar to vie, take a look at vietocr and jtessboxeditor. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6c54f502-0c92-424f-87ca-77fe58694d53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

haru...@gmail.com

unread,
Mar 30, 2019, 3:44:38 AM3/30/19
to tesseract-ocr
 Hi, you might have got confused with my other question. I am actually working on two languages. Neither of them are currently present in Tessseract. While one of them has somewhat similar  script/ letters with vie. This one has no connection/ totally different with any of the language currently available in tesseract. And also both the language has no connection with the vie. Thanks.

Shree Devi Kumar

unread,
Mar 30, 2019, 3:58:19 AM3/30/19
to tesser...@googlegroups.com
jtessboxeditor offers tesseract training for version 3.0x that's why I mentioned it.

For tesseract4, training steps are very different.

On Sat, Mar 30, 2019 at 1:14 PM <haru...@gmail.com> wrote:
 Hi, you might have got confused with my other question. I am actually working on two languages. Neither of them are currently present in Tessseract. While one of them has somewhat similar  script/ letters with vie. This one has no connection/ totally different with any of the language currently available in tesseract. And also both the language has no connection with the vie. Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

haru...@gmail.com

unread,
Mar 31, 2019, 12:25:23 AM3/31/19
to tesseract-ocr

Ok. thanks.

Could you guide me on how to train in tesseract 4?


On Saturday, March 30, 2019 at 1:28:19 PM UTC+5:30, shree wrote:
jtessboxeditor offers tesseract training for version 3.0x that's why I mentioned it.

For tesseract4, training steps are very different.

On Sat, Mar 30, 2019 at 1:14 PM <haru...@gmail.com> wrote:
 Hi, you might have got confused with my other question. I am actually working on two languages. Neither of them are currently present in Tessseract. While one of them has somewhat similar  script/ letters with vie. This one has no connection/ totally different with any of the language currently available in tesseract. And also both the language has no connection with the vie. Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

shree

unread,
Mar 31, 2019, 4:26:14 AM3/31/19
to tesseract-ocr

On Sunday, March 31, 2019 at 9:55:23 AM UTC+5:30, haru...@gmail.com wrote:

Ok. thanks.

Could you guide me on how to train in tesseract 4?

 


If you want to use the automated tesstrain.sh method, you will have to add the new language code to tesseract/src/training/language_specific.sh and create langdata (script unicharset, training_text, wordlist etc) for the same.

haru...@gmail.com

unread,
Mar 31, 2019, 5:56:33 AM3/31/19
to tesseract-ocr
I will have a look into this. Thanks.
Reply all
Reply to author
Forward
0 new messages