Getting Error: No such file or directory: 'data/foo/all-lstmf'

Madhav Pandey

unread,

Apr 25, 2023, 6:22:01 AM4/25/23

to tesseract-ocr

Hi Everyone,

I am relatively new to tesseract and OCR as whole.

I have been trying to training do the setup for training model locally using the guide https://github.com/tesseract-ocr/tesstrain/blob/main/README.md

I have copied the sample training data into the `data/foo` directory but when I run `make training`, I will always end up getting this error:

```Failed to read data from: data/foo/all-gt

Wrote unicharset file data/foo/unicharset
python3 shuffle.py 0 "data/foo/all-lstmf"
Traceback (most recent call last):
File "shuffle.py", line 24, in <module>
fd0 = open(sys.argv[2], 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'data/foo/all-lstmf'
make: *** [data/foo/all-lstmf] Error 1

```

Can someone please help resolve this error?

Thank you!

Zdenko Podobny

unread,

Apr 25, 2023, 6:27:28 AM4/25/23

to tesser...@googlegroups.com

Did you install all the necessary dependencies?

Did you check & fixed all errors (before this error) in training output?

Zdenko

ut 25. 4. 2023 o 8:21 Madhav Pandey <mad.dev...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/249216fc-70e5-4e40-a630-d4202fd24a36n%40googlegroups.com.

Madhav Pandey

unread,

Apr 26, 2023, 12:07:06 AM4/26/23

to tesseract-ocr

@zdenop

This is the entire training output:

```make training TESSDATA=./usr/local/share/tessdata
unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 2 "data/foo/all-gt"

Failed to read data from: data/foo/all-gt
Wrote unicharset file data/foo/unicharset

PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" -t "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.gt.txt" > "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.box"
set -x; \
tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif" data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/alexis_ruhe01_1852_0087_027.tif data/foo-ground-truth/alexis_ruhe01_1852_0087_027 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" -t "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.gt.txt" > "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.box"
set -x; \
tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif" data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/alexis_ruhe01_1852_0018_022.tif data/foo-ground-truth/alexis_ruhe01_1852_0018_022 --psm 13 lstm.train
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" -t "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.gt.txt" > "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.box"
set -x; \
tesseract "data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif" data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train
+ tesseract data/foo-ground-truth/alexis_ruhe01_1852_0035_019.tif data/foo-ground-truth/alexis_ruhe01_1852_0035_019 --psm 13 lstm.train

python3 shuffle.py 0 "data/foo/all-lstmf"
Traceback (most recent call last):

File "/Users/m/Code/git/tesstrain/shuffle.py", line 24, in <module>

fd0 = open(sys.argv[2], 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'data/foo/all-lstmf'
make: *** [data/foo/all-lstmf] Error 1```

For this run, I just have 3 text and tif files.

I did follow macos installation section from this page: https://tesseract-ocr.github.io/tessdoc/Compiling.html#macos and installed everything that is mentioned here.

Do I have to install anything else before running the training?

Zdenko Podobny

unread,

Apr 26, 2023, 3:47:55 PM4/26/23

to tesser...@googlegroups.com

make training TESSDATA=./usr/local/share/tessdata
unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 2 "data/foo/all-gt"

Failed to read data from: data/foo/all-gt....

This indicates you already run training that failed...

Clean your training and start it once again. Pay attention to why "data/foo/all-gt" is not created (there will be an error message).

Zdenko

st 26. 4. 2023 o 2:07 Madhav Pandey <mad.dev...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/98ffe203-7d53-4b57-a5e8-3edd3ae271cen%40googlegroups.com.

Madhav Pandey

unread,

Jun 2, 2023, 5:39:01 AM6/2/23

to tesseract-ocr

Hi Zdenko,

At what step in the make file the all-gt file is created? I am still unable to move forward with the custom model training.

Any help would be greatly appreciated. Thanks!

Madhav Pandey

unread,

Jun 5, 2023, 2:22:46 AM6/5/23

to tesseract-ocr

Hi Zdenop,

Apologies. I got your name wrong in the thread.

Can you please help me in resolving this issue? Because make training command was not creating the all-gt file. I manually created it and kept it at the MODEL_NAME directory.

The way I created it was by copy over all the single lines from the text files and storing it in the all-gt file. I am not sure if this is the right approach. Please correct me if I am wrong here.

Now after doing this, i am getting this error:

python3 shuffle.py 0 "data/Apex/all-lstmf"

Traceback (most recent call last):

File "/Users/madpande/Code/git/tesseract_tutorial/tesstrain/shuffle.py", line 24, in <module>

fd0 = open(sys.argv[2], 'r')

FileNotFoundError: [Errno 2] No such file or directory: 'data/Apex/all-lstmf'

I am pretty sure I am missing something here. Please help!

Thanks!

Zdenko Podobny

unread,

Jun 6, 2023, 8:03:17 AM6/6/23

to tesser...@googlegroups.com

Do not create files manually.

If "make training" does not work it means:

you miss some dependency or input data are wrong
also you miss error message for 1.

I strongly suggest you to start training from the beginning (including cloning tesstraing) and pay attention to all messages:

git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git

cd tesstrain
make tesseract-langdata
mkdir tessdata_best
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P tessdata_best
unzip ocrd-testset.zip -d data/ocrd-ground-truth

make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=10000

Zdenko

po 5. 6. 2023 o 4:22 Madhav Pandey <mad.dev...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d044535b-ef13-4e07-8c1f-3cbab7098883n%40googlegroups.com.

Madhav Pandey

unread,

Jun 7, 2023, 10:29:22 PM6/7/23

to tesseract-ocr

Thank you so much for this. It works on the default dataset provide.

However, when I try to work on hindi text, I get following error:

unicharset_extractor --output_unicharset "data/ocr/unicharset" --norm_mode 2 "data/ocr/all-gt"
Bad box coordinates in boxfile string! उसकी गाड़ी बड़ी थी और मेरी दाढ़ी
Extracting unicharset from plain text file data/ocr/all-gt
Wrote unicharset file data/ocr/unicharset
make: * No rule to make target 'data/ocr-ground-truth/1.lstmf', needed by 'data/ocr/all-lstmf'. Stop.

Going through some of the your responses on similar issue, you mentioned to check on the data format. Can you please specify what are the requirements for the grount-truth data?

I have the text file and the tiff file for the single line text images. What are the other requirements?

Thanks!

Madhav Pandey

unread,

Jun 8, 2023, 1:31:24 AM6/8/23

to tesseract-ocr

We were able to fix this issue.

Our training set contained files with extension .tiff. Expectation was .tif.

But now are seeing this error:

Compute CTC targets failed!

Do you have any knowledge on what might be happening here?

Thanks!

tesseract-ocr

unread,

Sep 19, 2023, 12:25:54 PM9/19/23

to tesseract-ocr

>Compute CTC targets failed!

That error is due to empty image files.

The text2image script is clouded with bugs. it creates null boxes as well as null images.

Dev Solution

unread,

Oct 27, 2023, 3:32:38 PM10/27/23

to tesseract-ocr

I just tried to run these all commands, but I got error https://prnt.sc/lLHeR27J2U65

Message has been deleted

Zdenko Podobny

unread,

Oct 28, 2023, 9:58:10 AM10/28/23

to tesser...@googlegroups.com

It does not work on windows (directly) but it works on linux => use WSL if you really need training.

Or wait until somebody find a fix for windows (or send the fix - this is an open source project so everybody should contribute ;-) )

Zdenko

pi 27. 10. 2023 o 17:32 Dev Solution <develop...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/22a2db2c-0738-4d5c-99de-f7761d40ddeen%40googlegroups.com.

Dev Solution

unread,

Oct 28, 2023, 3:37:01 PM10/28/23

to tesseract-ocr

Can I train my custom images? I'm going to build France Receipts scanner. So I need to train these all to increase accuracy. How do you suggest? Zdenop

Zdenko Podobny

unread,

Mar 27, 2024, 4:54:41 PM3/27/24

to tesser...@googlegroups.com

You can try custom images - see the example ocrd-testset.zip And follow the example from https://github.com/tesseract-ocr/tesstrain/blob/main/README.md :

unzip ocrd-testset.zip -d data/ocrd-ground-truth
make training MODEL_NAME=ocrd START_MODEL=deu_latf TESSDATA=~/tessdata_best MAX_ITERATIONS=10000

Zdenko

so 28. 10. 2023 o 17:37 Dev Solution <develop...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f50e814c-3edf-45ef-aed6-bb379b2d1ef0n%40googlegroups.com.

Reply all

Reply to author

Forward