"Can't encode transcript" error when using "lstmtraining" command with Tess4.0

56 views
Skip to first unread message

roberty...@gmail.com

unread,
Jul 25, 2017, 3:16:47 AM7/25/17
to tesseract-ocr
Hello,

I apply the command to train my own traineddata:
lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
  --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
  --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
  --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
  --target_error_rate 0.01 

An error appears by Tess4.0 that shown in the following img. The system (Tess4.0) says "Can't encode transcript" for text content such as "化简(-x2)3的结果是...".
Why? Can you help me?

ShreeDevi Kumar

unread,
Jul 25, 2017, 3:23:08 AM7/25/17
to tesser...@googlegroups.com
That error is because some characters in your training text are not part of the unicharset of chi_sim.

You are trying finetune training which will give error. Replace top layer will work.

I suggest that you wait 2-3 weeks for Ray to upload new traineddata for all languages. 

You can tell us if there are any specific characters missing from existing traineddata .

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e2e1d749-a55d-4355-b128-5d0fe2181e19%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

roberty...@gmail.com

unread,
Jul 25, 2017, 3:34:21 AM7/25/17
to tesseract-ocr
Thanks for helpness.

I will finetune with new traineddata for all languages after 2-3 weeks, and give feedback to evaluate the specific characters.

在 2017年7月25日星期二 UTC+8下午3:23:08,shree写道:
That error is because some characters in your training text are not part of the unicharset of chi_sim.

You are trying finetune training which will give error. Replace top layer will work.

I suggest that you wait 2-3 weeks for Ray to upload new traineddata for all languages. 

You can tell us if there are any specific characters missing from existing traineddata .

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jul 25, 2017 at 12:46 PM, <roberty...@gmail.com> wrote:
Hello,

I apply the command to train my own traineddata:
lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
  --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
  --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
  --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
  --target_error_rate 0.01 

An error appears by Tess4.0 that shown in the following img. The system (Tess4.0) says "Can't encode transcript" for text content such as "化简(-x2)3的结果是...".
Why? Can you help me?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

roberty...@gmail.com

unread,
Aug 1, 2017, 4:15:32 AM8/1/17
to tesseract-ocr
Hello, Shree:

I'm sorry, but whether can I use more than one unicharset, such as chi_sim and eng and so on, to finetune the training?
Maybe some special characters can be in other unicharsets. If I find it/them, maybe I will train my traineddata with more unicharsets, and the special characters will be encoded at that time.

Thanks, and hope for your reply.


在 2017年7月25日星期二 UTC+8下午3:23:08,shree写道:
That error is because some characters in your training text are not part of the unicharset of chi_sim.

You are trying finetune training which will give error. Replace top layer will work.

I suggest that you wait 2-3 weeks for Ray to upload new traineddata for all languages. 

You can tell us if there are any specific characters missing from existing traineddata .

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Jul 25, 2017 at 12:46 PM, <roberty...@gmail.com> wrote:
Hello,

I apply the command to train my own traineddata:
lstmtraining --model_output ~/tesstutorial/chituned_from_chisim/chituned \
  --continue_from ~/tesstutorial/chituned_from_chisim/chi_sim.lstm \
  --train_listfile ~/tesstutorial/chitest/chi.training_files.txt \
  --eval_listfile ~/tesstutorial/chitest/chi.training_files.txt \
  --target_error_rate 0.01 

An error appears by Tess4.0 that shown in the following img. The system (Tess4.0) says "Can't encode transcript" for text content such as "化简(-x2)3的结果是...".
Why? Can you help me?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Aug 1, 2017, 4:45:07 AM8/1/17
to tesser...@googlegroups.com
Ray has uploaded new traineddata files in https://github.com/tesseract-ocr/tessdata/tree/master/best

Why don't you first try recognition with that

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

roberty...@gmail.com

unread,
Aug 1, 2017, 5:36:40 AM8/1/17
to tesseract-ocr
OK,I will have a try. Thanks

在 2017年8月1日星期二 UTC+8下午4:45:07,shree写道:

roberty...@gmail.com

unread,
Aug 1, 2017, 6:03:13 AM8/1/17
to tesseract-ocr
When I use the new traineddata, it will report an error: cannot find the chi_sim.traineddata. Does the new traineddata only support the Tess4.0 alpa release? I use the newest code release.


在 2017年8月1日星期二 UTC+8下午4:45:07,shree写道:
Ray has uploaded new traineddata files in https://github.com/tesseract-ocr/tessdata/tree/master/best
Message has been deleted

roberty...@gmail.com

unread,
Aug 4, 2017, 2:51:44 AM8/4/17
to tesseract-ocr
Hi, Shree,

I have also tried the new traineddata to recognize the simplified Chinese with the Linux system (ubuntu), and it works. but it seems that the new traineddata dosen't support in the windows.

For the new traineddata in the ubuntu, there is also some special symbols cannot be recognized, such as, '∠', '△', '≌', '≥' and so on.

And, I will improve these special symbols' recognition. But there is no good way to implement it now. Can you give me some advice?

Thanks.


在 2017年8月1日星期二 UTC+8下午4:45:07,shree写道:
Ray has uploaded new traineddata files in https://github.com/tesseract-ocr/tessdata/tree/master/best
Reply all
Reply to author
Forward
0 new messages