Compute CTC targets failed while training

558 views
Skip to first unread message

Zohreh Khosrobeygi

unread,
Sep 25, 2018, 4:20:27 PM9/25/18
to tesseract-ocr
Hi, I use this :
tesseract 4.0.0-beta.4
 leptonica-1.74.4
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8

 Found AVX2
 Found AVX
 Found SSE
I've trained about 18000 line for persian language. I use this command:

bash -x tesstrain.sh --fonts_dir /usr/share/fonts --lang fas    --training_text   /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.training_text.txt --wordlist /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.wordlist.txt  --linedata_only \
  --noextract_font_properties --langdata_dir /home/zohreh/Desktop/tesseract-master/src/training/langdata \
  --tessdata_dir /home/zohreh/Desktop/tesseract-master/tessdata \
  --fontlist "Arial" --output_dir /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2
and then run this:
sudo /home/zohreh/Desktop/tesseract-master/src/training/lstmtraining   \
  --traineddata /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas/fas.traineddata   --net_spec '[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]' \
  --model_output /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/base --learning_rate 0.001 \
  --train_listfile /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas.training_files.txt \
  --eval_listfile /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/v/fas.training_files.txt \
  --max_iterations 5000 &>/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/basetrain.log
but always show Compute CTC targets failed and the model is not well at all.
I normal my text and each line of the text have 20 token(max).
Could you pleas help me?
 

Shree Devi Kumar

unread,
Sep 25, 2018, 5:23:26 PM9/25/18
to tesser...@googlegroups.com
  --fontlist "Arial" 

Does that have good coverage for Farsi?


--max_iterations 5000 

You are trying to train from scratch with 18000 lines of text and only 5000 iterations. That will not work.

Ray has trained on hundreds of thousands of lines of text and millions of iterations.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/04872dc6-7d92-4f95-9f65-8bb0cbf87c8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Khosrobeigy.zohreh

unread,
Sep 26, 2018, 4:01:24 AM9/26/18
to tesser...@googlegroups.com
I know, actually I am master in lstm. I want to resolve all error and then train big text. 
By version alpha, I trained about 1000 line and it is not so bad. But in version beta 4 I got many error. 
In alpha,
# Use LSTM
tessedit_ocr_engine_mode 1
tessedit_pageseg_mode 6

# Arabic page layout variables
segment_nonalphabetic_script 1

# Avoid dropping rows
textord_noise_rowratio 20.0
textord_noise_syfract 0.6

textord_min_linesize 2.5

# Avoid over-estimating intra-word spacing at both row and
# block levels when using old to method
tosp_old_to_method T
tosp_old_to_constrain_sp_kn T
tosp_old_sp_kn_th_factor 4.0

tosp_only_small_gaps_for_kern T
tosp_use_pre_chopping T
 I used all these, but now my model doesn't learn.
Has any thing changed in beta 4 for example text2image?

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/hGQMuZip6io/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--
Zohreh Khosrobeygi
University of Tehran, 2016

Shree Devi Kumar

unread,
Sep 26, 2018, 5:04:25 AM9/26/18
to tesser...@googlegroups.com

>By version alpha, I trained about 1000 line and it is not so bad

You must have only done fine tuning of model then and now you are trying to train from scratch.

Khosrobeigy.zohreh

unread,
Sep 26, 2018, 5:20:32 AM9/26/18
to tesser...@googlegroups.com
No, I always train from scratch.
best fast.traindata doesn't recognize eng and persian and the accuracy is too low in some fonts.
I want to solve this problem.
For fine tune can have different unicharset. As I read in wiki of tesseract, it is the number of class of lstm. So if Mr. Smit has trained for example 120 unicharset, can i have 160 unicharset in fine tune?
As I know the number of class in lstm cannot change.
all character in eng and fas and punc are aroud 164 character.


For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages