Compute CTC targets failed while training

Zohreh Khosrobeygi

unread,

Sep 25, 2018, 4:20:27 PM9/25/18

to tesseract-ocr

Hi, I use this :

tesseract 4.0.0-beta.4

leptonica-1.74.4

libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8

Found AVX2

Found AVX

Found SSE

I've trained about 18000 line for persian language. I use this command:

bash -x tesstrain.sh --fonts_dir /usr/share/fonts --lang fas --training_text /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.training_text.txt --wordlist /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/fas.wordlist.txt --linedata_only \

--noextract_font_properties --langdata_dir /home/zohreh/Desktop/tesseract-master/src/training/langdata \

--tessdata_dir /home/zohreh/Desktop/tesseract-master/tessdata \

--fontlist "Arial" --output_dir /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2

and then run this:

sudo /home/zohreh/Desktop/tesseract-master/src/training/lstmtraining \

--traineddata /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas/fas.traineddata --net_spec '[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]' \

--model_output /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/base --learning_rate 0.001 \

--train_listfile /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Phase2/fas.training_files.txt \

--eval_listfile /home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/v/fas.training_files.txt \

--max_iterations 5000 &>/home/zohreh/Desktop/tesseract-master/src/training/langdata/fas/Out/basetrain.log

but always show Compute CTC targets failed and the model is not well at all.

I normal my text and each line of the text have 20 token(max).

Could you pleas help me?

Shree Devi Kumar

unread,

Sep 25, 2018, 5:23:26 PM9/25/18

to tesser...@googlegroups.com

--fontlist "Arial"

Does that have good coverage for Farsi?

--max_iterations 5000

You are trying to train from scratch with 18000 lines of text and only 5000 iterations. That will not work.

Ray has trained on hundreds of thousands of lines of text and millions of iterations.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/04872dc6-7d92-4f95-9f65-8bb0cbf87c8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Khosrobeigy.zohreh

unread,

Sep 26, 2018, 4:01:24 AM9/26/18

to tesser...@googlegroups.com

I know, actually I am master in lstm. I want to resolve all error and then train big text.

By version alpha, I trained about 1000 line and it is not so bad. But in version beta 4 I got many error.

In alpha,

# Use LSTM

tessedit_ocr_engine_mode 1

tessedit_pageseg_mode 6

# Arabic page layout variables

segment_nonalphabetic_script 1

# Avoid dropping rows

textord_noise_rowratio 20.0

textord_noise_syfract 0.6

textord_min_linesize 2.5

# Avoid over-estimating intra-word spacing at both row and

# block levels when using old to method

tosp_old_to_method T

tosp_old_to_constrain_sp_kn T

tosp_old_sp_kn_th_factor 4.0

tosp_only_small_gaps_for_kern T

tosp_use_pre_chopping T

I used all these, but now my model doesn't learn.

Has any thing changed in beta 4 for example text2image?

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/hGQMuZip6io/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUcjmoC%2BfvY5qvn3e4PBVMhBFiEGDGP9WCkEUnsygQTpw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Zohreh Khosrobeygi

University of Tehran, 2016

Tel: +989196042887

khosrobe...@ut.ac.ir

Shree Devi Kumar

unread,

Sep 26, 2018, 5:04:25 AM9/26/18

to tesser...@googlegroups.com

>By version alpha, I trained about 1000 line and it is not so bad

You must have only done fine tuning of model then and now you are trying to train from scratch.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAE1QSgxi-B-N7K32SzHtaxoQFQiYLVA%3Du65V6stVG3vPEJmMRw%40mail.gmail.com.

Khosrobeigy.zohreh

unread,

Sep 26, 2018, 5:20:32 AM9/26/18

to tesser...@googlegroups.com

No, I always train from scratch.

best fast.traindata doesn't recognize eng and persian and the accuracy is too low in some fonts.

I want to solve this problem.

For fine tune can have different unicharset. As I read in wiki of tesseract, it is the number of class of lstm. So if Mr. Smit has trained for example 120 unicharset, can i have 160 unicharset in fine tune?

As I know the number of class in lstm cannot change.

all character in eng and fas and punc are aroud 164 character.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWbkUXCzx7LE41F6p6R4WCj-_YCPDQuaJJOAstd0BgO0w%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward