Error: Deserialize header failed while fine-tuning Tesseract

1,168 views
Skip to first unread message

Pranav Budhwant

unread,
Aug 30, 2019, 9:48:28 AM8/30/19
to tesseract-ocr
Tesseract Version:

tesseract 5.0.0-alpha leptonica-1.75.3
  libgif
5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 
Found AVX2
 
Found AVX
 
Found SSE


Platform:
Linux pranav-vm 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1 13:51:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


Current Behavior:
I am trying to fine-tune tesseract lstm. I have done the following:

1. Downloaded & Extracted the current trained model for eng:
cd tesseract/src/training/mkdir extracted
combine_tessdata
-e /usr/local/share/tessdata/eng.traineddata extracted/eng.lstm


2. Generated the *.lstmf files from *.tif and *.box files using:
for file in *.tif; do
  echo $file
 
base=`basename $file .tif`
  tesseract $file $base
--psm 7 nobatch lstm.train
done


3. Generated all-lstmf and list.train, list.eval files using:

ls -1 *.lstmf | sort -R > all-lstmf
head
-n  500 all-lstmf > list.eval
tail
-n +500 all-lstmf > list.train

While generating the *.lstmf files, Tesseract threw the following warning:
Warning. Invalid resolution 0 dpi. Using 70 instead.

4. Training the model using:

lstmtraining \        
--model_output ~/icr/train_output/ \
--continue_from /home/pranav/tesseract/src/training/extracted/eng.lstm \
--traineddata /usr/local/share/tessdata/eng.traineddata \
--train_listfile tune/list.train \
--eval_listfile tune/list.eval


This however, throws the following error:

Loaded file /home/pranav/tesseract/src/training/extracted/eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from /home/pranav/tesseract/src/training/extracted/eng.lstm
Deserialize header failed: ~/icr/train/a01-014-03.lstmf
Deserialize header failed: ~/icr/train/n04-107-01.lstmf
Deserialize header failed: ~/icr/train/g06-037f-02.lstmf
Deserialize header failed: ~/icr/train/r03-090-03.lstmf
Deserialize header failed: ~/icr/train/r03-084-09.lstmf
Deserialize header failed: ~/icr/train/g06-037e-02.lstmf
Load of page 0 failed!
Load of images failed!!
Deserialize header failed: ~/icr/train/j01-066-09.lstmf
Deserialize header failed: ~/icr/train/k04-075-02.lstmf
Deserialize header failed: ~/icr/train/n02-127-00.lstmf


I have generated the *.box files in Windows, following the guidelines for tesseract 4.0. I have converted the EOL of these box files to unix using dos2unix format converter.
I have attached a sample .box file and the all-lstmf file for reference.





all-lstmf
a01-000u-00.box

Pranav Budhwant

unread,
Sep 3, 2019, 7:40:28 AM9/3/19
to tesseract-ocr
I tried the same with Tesseract 4.1, and I generated all the files on Ubuntu instead of creating them on Windows and then converting to Unix formats. It still gives the same error. Please can anyone help me out here? I don't know what I'm doing wrong.

Shree Devi Kumar

unread,
Sep 3, 2019, 12:21:58 PM9/3/19
to tesseract-ocr
Test with 5-10 files to figure out correct process. Probably files are not in the correct location or format.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2ab512d6-5406-4571-a5de-7ed4e4e023d3%40googlegroups.com.

shree

unread,
Sep 3, 2019, 1:06:34 PM9/3/19
to tesseract-ocr
Your box files shows Windows CRLF rather than Unix LF. Try opening in notepad++ and check.

Pranav Budhwant

unread,
Sep 3, 2019, 1:22:34 PM9/3/19
to tesseract-ocr
@Shree, thanks for the help! Actually there were two things wrong with what I was doing, I had forgotten to add a TAB at the end to mark the end of line, also I generated the box files in ubuntu and it works now!

On Friday, August 30, 2019 at 7:18:28 PM UTC+5:30, Pranav Budhwant wrote:
Reply all
Reply to author
Forward
0 new messages