Getting Error: No such file or directory: 'data/foo/all-lstmf'

1,052 views
Skip to first unread message

Madhav Pandey

unread,
Apr 25, 2023, 6:22:01 AM4/25/23
to tesseract-ocr
Hi Everyone,

I am relatively new to tesseract and OCR as whole. 

I have been trying to training do the setup for training model locally using the guide https://github.com/tesseract-ocr/tesstrain/blob/main/README.md

I have copied the sample training data into the `data/foo` directory but when I run `make training`, I will always end up getting this error:

```Failed to read data from: data/foo/all-gt
Wrote unicharset file data/foo/unicharset
python3 shuffle.py 0 "data/foo/all-lstmf"
Traceback (most recent call last):
  File "shuffle.py", line 24, in <module>
    fd0 = open(sys.argv[2], 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'data/foo/all-lstmf'
make: *** [data/foo/all-lstmf] Error 1
```

Can someone please help resolve this error?

Thank you!

Zdenko Podobny

unread,
Apr 25, 2023, 6:27:28 AM4/25/23
to tesser...@googlegroups.com
Did you install all the necessary dependencies?
Did you check & fixed all errors (before this error) in training output?

Zdenko


ut 25. 4. 2023 o 8:21 Madhav Pandey <mad.dev...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com.

Madhav Pandey

unread,
Apr 26, 2023, 12:07:06 AM4/26/23
to tesseract-ocr
@zdenop 

This is the entire training output:

```make training TESSDATA=./usr/local/share/tessdata
unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 2 "data/foo/all-gt"

Failed to read data from: data/foo/all-gt
Wrote unicharset file data/foo/unicharset
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" -t "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.gt.txt" > "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.box"
set -x; \
        tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" > "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box"
set -x; \
        tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" -t "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.gt.txt" > "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.box"
set -x; \
        tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train

python3 shuffle.py 0 "data/foo/all-lstmf"
Traceback (most recent call last):
  File "/Users/m/Code/git/tesstrain/shuffle.py", line 24, in <module>

    fd0 = open(sys.argv[2], 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'data/foo/all-lstmf'
make: *** [data/foo/all-lstmf] Error 1```

For this run, I just have 3 text and tif files. 

I did follow macos installation section from this page: https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos and installed everything that is mentioned here. 

Do I have to install anything else before running the training? 

Zdenko Podobny

unread,
Apr 26, 2023, 3:47:55 PM4/26/23
to tesser...@googlegroups.com
make training TESSDATA=./usr/local/share/tessdata
unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 2 "data/foo/all-gt"
Failed to read data from: data/foo/all-gt....

This indicates you already run training that failed...
Clean your training and start it once again. Pay attention to why "data/foo/all-gt" is not created (there will be an error message).

Zdenko


st 26. 4. 2023 o 2:07 Madhav Pandey <mad.dev...@gmail.com> napísal(a):

Madhav Pandey

unread,
Jun 2, 2023, 5:39:01 AM6/2/23
to tesseract-ocr
Hi Zdenko,

At what step in the make file the all-gt file is created? I am still unable to move forward with the custom model training. 

Any help would be greatly appreciated. Thanks!

Madhav Pandey

unread,
Jun 5, 2023, 2:22:46 AM6/5/23
to tesseract-ocr
Hi Zdenop,

Apologies. I got your name wrong in the thread. 

Can you please help me in resolving this issue? Because make training command was not creating the all-gt file. I manually created it and kept it at the MODEL_NAME directory. 

The way I created it was by copy over all the single lines from the text files and storing it in the all-gt file. I am not sure if this is the right approach. Please correct me if I am wrong here. 

Now after doing this, i am getting this error:

python3 shuffle.py 0 "data/Apex/all-lstmf"

Traceback (most recent call last):
  File "/Users/madpande/Code/git/tesseract_tutorial/tesstrain/shuffle.py", line 24, in <module>

    fd0 = open(sys.argv[2], 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'data/Apex/all-lstmf'


I am pretty sure I am missing something here. Please help!

Thanks!

Zdenko Podobny

unread,
Jun 6, 2023, 8:03:17 AM6/6/23
to tesser...@googlegroups.com
Do not create files manually.
If "make training" does not work it means:
  1. you miss some dependency or input data are wrong
  2. also you miss error message for 1.
I strongly suggest you to start training from the beginning (including cloning tesstraing) and pay attention to all messages:

cd tesstrain
make tesseract-langdata
mkdir tessdata_best
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P tessdata_best
unzip ocrd-testset.zip -d data/ocrd-ground-truth
make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=10000


Zdenko


po 5. 6. 2023 o 4:22 Madhav Pandey <mad.dev...@gmail.com> napísal(a):

Madhav Pandey

unread,
Jun 7, 2023, 10:29:22 PM6/7/23
to tesseract-ocr
Thank you so much for this. It works on the default dataset provide. 

However, when I try to work on hindi text, I get following error:

unicharset_extractor --output_unicharset "data/ocr/unicharset" --norm_mode 2 "data/ocr/all-gt"
Bad box coordinates in boxfile string! उसकी गाड़ी बड़ी थी और मेरी दाढ़ी
Extracting unicharset from plain text file data/ocr/all-gt
Wrote unicharset file data/ocr/unicharset
make: * No rule to make target 'data/ocr-ground-truth/1.lstmf', needed by 'data/ocr/all-lstmf'.  Stop.

Going through some of the your responses on similar issue, you mentioned to check on the data format. Can you please specify what are the requirements for the grount-truth data?
I have the text file and the tiff file for the single line text images. What are the other requirements?

Thanks!

Madhav Pandey

unread,
Jun 8, 2023, 1:31:24 AM6/8/23
to tesseract-ocr
We were able to fix this issue.

Our training set contained files with extension .tiff. Expectation was .tif. 

But now are seeing this error:

Compute CTC targets failed!

Do you have any knowledge on what might be happening here?

Thanks!

tesseract-ocr

unread,
Sep 19, 2023, 12:25:54 PM9/19/23
to tesseract-ocr
>Compute CTC targets failed!

That error is due to empty image files. 
The text2image script  is clouded with bugs. it creates null boxes as well as null images. 

Dev Solution

unread,
Oct 27, 2023, 3:32:38 PM10/27/23
to tesseract-ocr

I just tried to run these all commands, but I got error https://prnt.sc/lLHeR27J2U65
Message has been deleted
Message has been deleted

Zdenko Podobny

unread,
Oct 28, 2023, 9:58:10 AM10/28/23
to tesser...@googlegroups.com
It does not work on windows (directly) but it works on linux => use WSL if you really need training. 
Or wait until somebody find a fix for windows (or send the fix - this is an open source project so everybody should contribute ;-) )

Zdenko


pi 27. 10. 2023 o 17:32 Dev Solution <develop...@gmail.com> napísal(a):

Dev Solution

unread,
Oct 28, 2023, 3:37:01 PM10/28/23
to tesseract-ocr
Can I train my custom images? I'm going to build France Receipts scanner. So I need to train these all to increase accuracy. How do you suggest? Zdenop

Zdenko Podobny

unread,
Mar 27, 2024, 4:54:41 PM3/27/24
to tesser...@googlegroups.com
You can try custom images - see the example  ocrd-testset.zip And follow the example from https://github.com/tesseract-ocr/tesstrain/blob/main/README.md :
unzip ocrd-testset.zip -d data/ocrd-ground-truth
make training MODEL_NAME=ocrd START_MODEL=deu_latf TESSDATA=~/tessdata_best MAX_ITERATIONS=10000

Zdenko


so 28. 10. 2023 o 17:37 Dev Solution <develop...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages