START_MODEL gives Segmentation Failure Error

157 views
Skip to first unread message

Madhav Pandey

unread,
Jun 23, 2023, 1:58:38 AM6/23/23
to tesseract-ocr
Hey Guys,

I hope you're all doing well. We have been working on training a model using a handwritten font in Tesseract OCR, and we have encountered an issue related to the START_MODEL flag.

Currently, we are using the following command for training:

make tesstrain START_MODEL=hin TESSDATA=tessdata_best

However, whenever we include the START_MODEL flag, we consistently encounter the "Segmentation Failure" error. Strangely, when we omit the START_MODEL flag and only specify the TESSDATA path, the training process runs without any failures.

I have a couple of questions regarding this issue:
1. When we omit the START_MODEL flag and solely provide the TESSDATA path, which base model does Tesseract use for training?
2. Is there any specific reason why we are encountering the "Segmentation Failure" error when using the START_MODEL flag?

I would appreciate any insights or guidance you can provide regarding these questions. Thank you all for your support.

Zdenko Podobny

unread,
Jun 24, 2023, 10:05:26 AM6/24/23
to tesser...@googlegroups.com
Hello,

If you are really looking for help, you need to provide full details (e.g. whole log of training, how did you installed tesseract, which version of tesseract, how did you install model (specieally hin model) example of training data that help to replicate "Segmentation Failure" etc.


Zdenko


pi 23. 6. 2023 o 7:58 Madhav Pandey <mad.dev...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/40c683aa-3e34-40b5-bcc1-199227c52ab0n%40googlegroups.com.

Madhav Pandey

unread,
Jun 27, 2023, 1:35:01 AM6/27/23
to tesseract-ocr
Hi,

What is the good to capture the logs for model training?

Thanks!

Madhav Pandey

unread,
Jun 28, 2023, 10:48:36 AM6/28/23
to tesseract-ocr
Hi,

Please find the detail below:

1. Dataset: It's available in marathi handwritten zip folder here https://github.com/codeatpanorama/training-data/blob/main/marathi_handwritten_text.zip
3. Command Used: nohup make training MODEL_NAME=mar_hw START_MODEL=mar TESSDATA=tessdata_best MAX_ITERATIONS=10000 LANG_TYPE=Indic > plot/TESSTRAIN.LOG &
4. We installed mar.traineddata using this command wget https://github.com/tesseract-ocr/tessdata/raw/main/mar.traineddata -P tessdata_best

This is out of tesseract version:

tesseract 4.1.1

 leptonica-1.79.0

  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1

 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

Please help us unblock here.

Also, When we omit the START_MODEL flag and solely provide the TESSDATA path, which base model does Tesseract use for training?

Thanks!

Madhav Pandey

unread,
Jun 29, 2023, 8:50:01 PM6/29/23
to tesseract-ocr
Hi Zdenop, 

Can you please provide some input here on why we might me getting this error?

Thanks!

Madhav Pandey

unread,
Jul 6, 2023, 1:50:03 PM7/6/23
to tesseract-ocr
Hi Zdenop, I have provided all the information that you asked for in this thread. Can you please help us here? 

If anything is missing, please let me know I can provide all the information that you need. 

Thanks!

Zdenko Podobny

unread,
Jul 8, 2023, 11:17:09 AM7/8/23
to tesser...@googlegroups.com
Let's start with the basics:

The current leptonica version is 1.83.1 https://github.com/DanBloomberg/leptonica/releases
The current tesseract version is 5.3.1 https://github.com/tesseract-ocr/tesseract/releases

Use the latest version if there is a problem. Nobody wants to waste time with (probably) already fixed issues.

"4. We installed mar.traineddata ... -P tessdata_best"  Why you pretend that tessdata is tessdata_best???
If you have a problem - stick to manual/documentation.

Zdenko


št 6. 7. 2023 o 19:50 Madhav Pandey <mad.dev...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Madhav Pandey

unread,
Jul 9, 2023, 3:34:41 PM7/9/23
to tesseract-ocr
Thanks for responding.

I will update the version and see if it fixes things for me. I will keep you posted. 

Reply all
Reply to author
Forward
0 new messages