Tesseract Version:
tesseract 5.0.0-alpha leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
Platform:
Linux pranav-vm 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1 13:51:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Current Behavior:
I am trying to fine-tune tesseract lstm. I have done the following:
1. Downloaded & Extracted the current trained model for eng:
cd tesseract/src/training/mkdir extracted
combine_tessdata -e /usr/local/share/tessdata/eng.traineddata extracted/eng.lstm
2. Generated the *.lstmf files from *.tif and *.box files using:
for file in *.tif; do
echo $file
base=`basename $file .tif`
tesseract $file $base --psm 7 nobatch lstm.train
done
3. Generated all-lstmf and list.train, list.eval files using:
ls -1 *.lstmf | sort -R > all-lstmf
head -n 500 all-lstmf > list.eval
tail -n +500 all-lstmf > list.train
While generating the *.lstmf files, Tesseract threw the following warning:
Warning. Invalid resolution 0 dpi. Using 70 instead.
4. Training the model using:
lstmtraining \
--model_output ~/icr/train_output/ \
--continue_from /home/pranav/tesseract/src/training/extracted/eng.lstm \
--traineddata /usr/local/share/tessdata/eng.traineddata \
--train_listfile tune/list.train \
--eval_listfile tune/list.eval
This however, throws the following error:
Loaded file /home/pranav/tesseract/src/training/extracted/eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from /home/pranav/tesseract/src/training/extracted/eng.lstm
Deserialize header failed: ~/icr/train/a01-014-03.lstmf
Deserialize header failed: ~/icr/train/n04-107-01.lstmf
Deserialize header failed: ~/icr/train/g06-037f-02.lstmf
Deserialize header failed: ~/icr/train/r03-090-03.lstmf
Deserialize header failed: ~/icr/train/r03-084-09.lstmf
Deserialize header failed: ~/icr/train/g06-037e-02.lstmf
Load of page 0 failed!
Load of images failed!!
Deserialize header failed: ~/icr/train/j01-066-09.lstmf
Deserialize header failed: ~/icr/train/k04-075-02.lstmf
Deserialize header failed: ~/icr/train/n02-127-00.lstmf
I have generated the *.box files in Windows, following the guidelines for tesseract 4.0. I have converted the EOL of these box files to unix using dos2unix format converter.
I have attached a sample .box file and the all-lstmf file for reference.