I'm reading Using tesstrain (tesseract 4.0) wiki passage _ I have a question

909 views
Skip to first unread message

이경준

unread,
Feb 28, 2018, 2:21:17 AM2/28/18
to tesseract-ocr

Hi I'm studying this passage. But I cannot understand  what is that meaning flag "--noextract_font_properties" ? . so I saw the file /tesseract/training/tesstrain.sh  


But I cannot Find "--noextract_font_properites"

Here usage : 

# USAGE:
#
# tesstrain.sh
#    --fontlist FONTS           # A list of fontnames to train on.
#    --fonts_dir FONTS_PATH     # Path to font files.
#    --lang LANG_CODE           # ISO 639 code.
#    --langdata_dir DATADIR     # Path to tesseract/training/langdata directory.
#    --output_dir OUTPUTDIR     # Location of output traineddata file.
#    --overwrite                # Safe to overwrite files in output_dir.
#    --linedata_only            # Only generate training data for lstmtraining.
#    --run_shape_clustering     # Run shape clustering (use for Indic langs).
#    --exposures EXPOSURES      # A list of exposure levels to use (e.g. "-1 0 1").
#
# OPTIONAL flags for input data. If unspecified we will look for them in
# the langdata_dir directory.
#    --training_text TEXTFILE   # Text to render and use for training.
#    --wordlist WORDFILE        # Word list for the language ordered by
#                               # decreasing frequency.
#
# OPTIONAL flag to specify location of existing traineddata files, required
# during feature extraction. If unspecified will use TESSDATA_PREFIX defined in
# the current environment.
#    --tessdata_dir TESSDATADIR     # Path to tesseract/tessdata directory.
#
# NOTE:
# The font names specified in --fontlist need to be recognizable by Pango using
# fontconfig. An easy way to list the canonical names of all fonts available on
# your system is to run text2image with --list_available_fonts and the
# appropriate --fonts_dir path.






Using tesstrain

The setup for running tesstrain.sh is the same as for base Tesseract. Use --linedata_onlyoption for LSTM training. Note that it is beneficial to have more training text and make more pages though, as neural nets don't generalize as well and need to train on something similar to what they will be running on. If the target domain is severely limited, then all the dire warnings about needing a lot of training data may not apply, but the network specification may need to be changed.

Training data is created using tesstrain.sh as follows: Note that your fonts location may vary.

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain


Thank U Very much . I want to reply Everybody

ShreeDevi Kumar

unread,
Feb 28, 2018, 2:32:46 AM2/28/18
to tesser...@googlegroups.com
training/tesstrain.sh \
 --fonts_dir /usr/share/fonts \
 --lang eng \
 --linedata_only \
  --noextract_font_properties \
--langdata_dir ../langdata \
  --tessdata_dir ./tessdata  \
 --output_dir ~/tesstutorial/engtrain

You should try to follow the above tutorial for training eng.

You need to make sure the correct paths are given for the various directories.

You should know that tesseract will recognise Korean without training, using existing traineddata.

이경준

unread,
Feb 28, 2018, 3:53:00 AM2/28/18
to tesseract-ocr
Sorry . But I have issue about korea 

you mentioned answer is applyed to english . But , it doesn't work korea

In the logs . Font error . But I refer to the /training/language-specific.sh

vi language-specific.sh 

Font list - kor _NeoLatin

so I install korean fonts in there .

and reboot 

but same result .. Is it possible to solve korean font issue? 

2018년 2월 28일 수요일 오후 4시 21분 17초 UTC+9, 이경준 님의 말:

ShreeDevi Kumar

unread,
Feb 28, 2018, 4:18:41 AM2/28/18
to tesser...@googlegroups.com
Try with following - make sure that you change all variables with dir to match your setup 

tesstrain.sh \
 --lang kor \
 --noextract_font_properties \
 --linedata_only \
 --langdata_dir ../langdata \
 --tessdata_dir ../tessdata \
 --fonts_dir /mnt/c/Windows/Fonts \
 --fontlist \
  "Arial Unicode MS" \
 --output_dir ../tesstutorial/kor

The fontlist you specify in command will override the list in language_specific.sh



이경준

unread,
Feb 28, 2018, 10:51:45 PM2/28/18
to tesseract-ocr
Thank U reply my question.

But my system is operated by Ubuntu 16.04. 03 LTS 

I think that that path is not working  ? Am I false? 


2018년 2월 28일 수요일 오후 6시 18분 41초 UTC+9, shree 님의 말:

ShreeDevi Kumar

unread,
Feb 28, 2018, 10:59:31 PM2/28/18
to tesser...@googlegroups.com
​Tesseract4.00alpha gives good results for Korean recognition. Have you tried that? You may not need to do training.

If you want to do training for 4.00, you need files from langdata and tessdata_​best.





이경준

unread,
Mar 1, 2018, 12:16:18 AM3/1/18
to tesseract-ocr
Yes .I tried tessdata - kor.trainnedata /// But it is not good enough. sorry .ㅜㅜ i can not use tesseract 4.0 tessdata-kor.trainnedata. in bussiness .. 

So I must train 4.00 kor ... Thank you for advice

2018년 3월 1일 목요일 오후 12시 59분 31초 UTC+9, shree 님의 말:

ShreeDevi Kumar

unread,
Mar 1, 2018, 12:30:11 AM3/1/18
to tesser...@googlegroups.com
> my system is operated by Ubuntu 16.04. 03 LTS 

> Yes .I tried tessdata - kor.trainnedata /// But it is not good enough. sorry .ㅜㅜ i can not use tesseract 4.0 tessdata-kor.trainnedata. in bussiness .. 

I will suggest that you uninstall your old tesseract version.(3.0x)


sudo apt-get remove tesseract-ocr


and then install tesseracr4.00 version from the PPA provided by AlexanderP

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-kor


이경준

unread,
Mar 1, 2018, 1:00:36 AM3/1/18
to tesseract-ocr
Thank U . for advice

I have never installed tesseract (3.0x) 

I have a question 

your last command line means that install language pack in tessdata directory - kor.traineddata 

Am I false.

I want to say I use that way. but, my test image recognizision rate is not enough to use the business . 

we don't understand each otehr saying.

Thank u 

2018년 3월 1일 목요일 오후 2시 30분 11초 UTC+9, shree 님의 말:

ShreeDevi Kumar

unread,
Mar 1, 2018, 2:17:41 AM3/1/18
to tesser...@googlegroups.com
>we don't understand each otehr saying.

Sorry about that.

Please give the following commands and let me know the result.

tesseract -v

tesseract --list-langs

combine_tessdata -u kor.traineddata

I do not know Korean, but feedback from other users has been that tesseract4 and the latest trainedadata give good results.


>your last command line means that install language pack in tessdata directory - kor.traineddata 
>I want to say I use that way. but, my test image recognizision rate is not enough to use the business . 


There are three sets of traineddata files, in tessdata, tessdata_best and tessdata_fast repositories on github. 

I was suggesting that you use the ones from tessdata_fast, which are packaged by AlexanderP for older versions of Linux along with the latest version of the programs.

I am attaching a test image and the results that I get using Tesseract. To me the accuracy looks good, whether that is acceptable for use in business is a decision you have to make.





ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f1de33d0-e0c4-4d65-88b4-57c92562ea8a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

kor-wikipedia.txt
kor-wikipedia.png

이경준

unread,
Mar 1, 2018, 7:52:05 AM3/1/18
to tesseract-ocr
No. I'm really Sorry about complaining about tesseract(4.0) 

I mean that tesseract is great , but is not perfect(100%) 

I think that Tesseract is fairy good.

But, I have a clue about customizing and using Tesseract(4.0) rightfully

Thank U.

At first I know the trainneddata is 3 types

tessdata / tessdata_fast/ tessdata_best /

so, I used tessdata - kor.trainneddata to check tesseract(4.0) test ..

and you give me a right answer about using tesseract .(another questions I wrote to you)



But, you suggest use tessdata_fast ????? I have no idea ...... ㅜㅜㅜ

And. once upon a time I used tessdata_best / tessdata_fast /tessdata 3 type use respectively,

But correctness is the best data is tessdata .......... 




In the github page descption says that tessdata_best is best accuracy in 3 types tessdata...

But in actual is not right .ㅜㅜ Correct rate is fairy different tessdata from tessdata_best

So I use the tessdata . What makes me give me a that result

So. your last (recently) answer & sugestion  = in this passage 

you give me a suggestion  that in the command line to type " combine_tessdata -u kor.traineddata 

What is that meaning ? Could you explain for me ? 

(in short) 

1. I want to tesseract 4.0 using rightly in my business

so i have a plan using trainneddata.

 case  #1
if  already made and uploaded github page - 3 type trainned data( tessdata / tessdata_best / tessdata_fast) is  not good enough to use Korean in my business

 I would to make a  customized and trainned "New trainneddata"

But how can i make a decsion about that / I can make a treshold to help a descision ???? 


Plz help me .... ㅜㅜ 


2018년 3월 1일 목요일 오후 4시 17분 41초 UTC+9, shree 님의 말:
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

이경준

unread,
Mar 1, 2018, 7:55:35 AM3/1/18
to tesseract-ocr
And additonal question 

combine_tessdata -u kor.traineddata 

What is that "-u" what is that meaning ?? 

I can not find that option(flag) .. wiki - github page

Could you give me a explanation

2018년 2월 28일 수요일 오후 4시 21분 17초 UTC+9, 이경준 님의 말:

Hi I'm studying this passage. But I cannot understand  what is that meaning flag "--noextract_font_properties" ? . so I saw the file /tesseract/training/tesstrain.sh  

ShreeDevi Kumar

unread,
Mar 1, 2018, 8:06:57 AM3/1/18
to tesser...@googlegroups.com
> combine_tessdata -u kor.traineddata What is that meaning ? Could you explain for me ? 

That command will show and unpack the components of your traineddata file. 

eg. from tesdata_fast

combine_tessdata -u ./tessdata_fast/kor.traineddata ./tessdata_fast/kor.
Extracting tessdata components from ./tessdata_fast/kor.traineddata
Wrote ./tessdata_fast/kor.config
Wrote ./tessdata_fast/kor.lstm
Wrote ./tessdata_fast/kor.lstm-punc-dawg
Wrote ./tessdata_fast/kor.lstm-word-dawg
Wrote ./tessdata_fast/kor.lstm-number-dawg
Wrote ./tessdata_fast/kor.lstm-unicharset
Wrote ./tessdata_fast/kor.lstm-recoder
Wrote ./tessdata_fast/kor.version
Version string:4.00.00alpha:kor:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx384O1c1]
0:config:size=90, offset=192
17:lstm:size=973837, offset=282
18:lstm-punc-dawg:size=2602, offset=974119
19:lstm-word-dawg:size=605274, offset=976721
20:lstm-number-dawg:size=74, offset=1581995
21:lstm-unicharset:size=76228, offset=1582069
22:lstm-recoder:size=19034, offset=1658297
23:version:size=80, offset=1677331

ShreeDevi Kumar

unread,
Mar 1, 2018, 8:10:18 AM3/1/18
to tesser...@googlegroups.com
>  I would to make a  customized and trainned "New trainneddata"

OK. But training from scratch takes a lot of time. I assume that you want to finetune.

Please note that the traineddata files in tessdata and tessdata_best and tessdata_fast are NOT compatible. So, it depends on what version of tesseract program you are using.

I have already  sent you the bash script that you can modify for training.  

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

이경준

unread,
Mar 1, 2018, 8:21:00 AM3/1/18
to tesseract-ocr
Oh. I know ㅜㅜㅜ Thank u ㅜㅜㅜㅜ I was really impressd by U 

OK. Thank you very much 

Last question ... I can not understand .. trainned  data type

Your saying means that in the tesseract 4.0 / tessdata_best is better than tessdata  // ㅜㅜㅜ 

what is the tessdata_fast  ㅜㅜㅜㅜㅜㅜ ???? Fast integer versions of trained models

ㅜㅜ Sorry ㅜㅜㅜ ㅜplz help me ...
....ㅜㅜ

2018년 3월 1일 목요일 오후 10시 10분 18초 UTC+9, shree 님의 말:

ShreeDevi Kumar

unread,
Mar 1, 2018, 9:25:41 AM3/1/18
to tesser...@googlegroups.com
Tesseract 4.00 alpha has two OCR engines. One is the legacy tesseract engine which was used in 3.0x and the other is neural net based LSTM engine available in 4.00alpha - master branch in github.

the traineddata files in tesseract-ocr/tessdata have language models compatible with both of these. If you were to unpack the traineddata files with combine_tessdata -u, you will see that there are more components in files from   tesseract-ocr/tessdata . 

While most languages are supposed to have better accuracy with the newer LSTM based engine and models, there are certain cases in which legacy tesseract is better. Hence it is still being supported.

tessdata_best files are accurate and can be used as the base for further finetune training. These are only for the LSTM based engine.

tessdata_fast files are accurate and faster in processing, so it is recommended to use them for OCR.  These are only for the LSTM based engine.

The best way for you to compare these is to use a set of test images, OCR them using the different traineddatas and compare their accuracy using OCR evaluation software such as


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages