@Shree //// I have a question about Making a Traineddata which is finely tunned.

125 views
Skip to first unread message

이경준

unread,
Mar 4, 2018, 11:10:57 AM3/4/18
to tesseract-ocr
(In this page by Tesseract 4.0) I'm using tesseract 4.0

Hi. @Shree . You are always kind to me. Thank U. You give me lots of advices, suggestion, and teachings.

I'm really really very Thank U.

Comprehensively, I sort your advice and wiki (tesseract 4.0) -github. 

So. I conclude training steps for me - Korean traineddata which is finely tuneed

I'm using bash script and log.txt you gave me

In the wiki - github -training tesseract 4.0 - https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00


Overview of Training Process


  1. Prepare training text.
  2. Render text to image + box file. (Or create hand-made box files for existing image data.)
  3. Make unicharset file. (Can be partially specified, ie created manually).
  4. Make a starter traineddata from the unicharset and optional dictionary data.
  5. Run tesseract to process image + box file to make training data set.
  6. Run training on training data set.
  7. Combine data files.


I think that 3 , 4, 5, 6, 7 steps that don't matter to me. 

Becuase I just want to finely tune a "tessdata_fast-kor.traineddata" . So I use your bash script (attached)

Am I right???? Am I false???? 

After using your bash script, I must do 3, 4 ,5 ,6, 6 steps????? 

I think that this process(=using your bash script) makes a new "finely tuned kor.traineddata"

so I will use this data for my business.

If I'm worng. plz me a advice..... In advance I'm really really Thank U.

#section 1. (plus) I have a quesiton about a bash script you gave me


In the bash scripts


1. what is the criterion about extracting 100-120 lines ??? I have no idea.





2. The number of iterations  is 300


why? ... Is it possible to change this number???

3. Why you using one font .... is it possible to increase font of (count and sort ) = lots of fonts ??? (ex. Baekmuk.Dotum..



----------------------------------------------------------------------------------------------------------------------------------------
#section 2. have a relation with this page "https://github.com/tesseract-ocr/tesseract/issues/1172"



4. In tesseract/training folder , "language-spcific.sh" and "your bashscript" you gave me  have no relationship???

I think that they are share fonts ?..... So I think that I have to change "language-specific.sh" to use  "your bashscript" you gave me

Am I right? or False?

5. Someone made a *lstmf file by using this way .(attached)


Is it the same as using "tesstran.sh"???? 

Is it right ? (by tesseract 4.0) 



-------------------------------------------------------------------------------------------------------------------------------------------------



6. In my situation. to finely tune kor.traineddata which is existing made by Google

I 'm not concerd about "word list" . It doesnt' matter to me??? 

Am I right or false??



 
I want to your reply ....
I wait .. In advance I really thank U very much.


tesstrain_pluschars.sh
kor-tesstrain_pluschars-log.txt

ShreeDevi Kumar

unread,
Mar 4, 2018, 11:24:25 AM3/4/18
to tesser...@googlegroups.com
.

#section 1. (plus) I have a quesiton about a bash script you gave me


In the bash scripts


1. what is the criterion about extracting 100-120 lines ??? I have no idea.


Only 3 pages are processed by tesstrain.sh for making box/tiff files, so it will be about 120 lines of text.





2. The number of iterations  is 300


why? ... Is it possible to change this number???

Yes, you can change it. This is the recommended number, see training wiki details for finetuning.


3. Why you using one font .... is it possible to increase font of (count and sort ) = lots of fonts ??? (ex. Baekmuk.Dotum..

I was only testing. You can use lots of fonts and experiment.



----------------------------------------------------------------------------------------------------------------------------------------
#section 2. have a relation with this page "https://github.com/tesseract-ocr/tesseract/issues/1172"



4. In tesseract/training folder , "language-spcific.sh" and "your bashscript" you gave me  have no relationship???

I think that they are share fonts ?..... So I think that I have to change "language-specific.sh" to use  "your bashscript" you gave me

Am I right? or False?

You can specify the fonts via command line, then language_specific.sh does not need to be changed.

5. Someone made a *lstmf file by using this way .(attached)


I don't know.


Is it the same as using "tesstran.sh"???? 

Is it right ? (by tesseract 4.0) 



-------------------------------------------------------------------------------------------------------------------------------------------------



6. In my situation. to finely tune kor.traineddata which is existing made by Google

I 'm not concerd about "word list" . It doesnt' matter to me??? 

You can then not use the word list as part of command.


Am I right or false??



If you run that script with one font as an experiment, then you will know how it works.




 
I want to your reply ....
I wait .. In advance I really thank U very much.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5570a96c-1daf-44d6-a03d-c928a4200069%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Mar 4, 2018, 11:31:02 AM3/4/18
to tesser...@googlegroups.com
Once you make a small training text and choose the fonts to use and modify the bash script to point to correct directory in your setup, it will perform all the training steps for finetuning.


이경준

unread,
Mar 4, 2018, 10:34:23 PM3/4/18
to tesseract-ocr
Thank u very much .
(Plud)
At first, I have a question in my passage.
I DONT NEED TO IMPLEMENT ANY STEPS???

3,4,5,6,7(STEPS)

ALREDAY CONTAINDED????(MAKING FINELY tunèd traineddata)
Thank u

이경준

unread,
Mar 4, 2018, 10:35:09 PM3/4/18
to tesseract-ocr
Plud is plus, additionally
Reply all
Reply to author
Forward
0 new messages