LSTM-based training produces .box files with the same coordinates

137 views
Skip to first unread message

TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍

unread,
Nov 1, 2023, 5:15:24 AM11/1/23
to tesseract-ocr
Hi all,

I tried to run an example of LSTM training and used the following command:

for f in *.tif; do
    tesseract $f ${f%.*} -l deu lstmbox
done


The result of box files seems detect by single-level box instead of character-level box. All the character shares the same coordinates, width and height. Is it a features of tesseract LSTM traning? Thanks.

Untitled.png

Zdenko Podobny

unread,
Nov 1, 2023, 7:21:37 AM11/1/23
to tesser...@googlegroups.com
Are you following official tutorials? 
Did you read the documentation?
Have you tried to check the official training repository and provided examples?

Zdenko


st 1. 11. 2023 o 10:15 TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍ <khanht...@khu.ac.kr> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5f19f8c5-b728-4b97-888d-76dc60d829acn%40googlegroups.com.

Keith Smith

unread,
Nov 1, 2023, 7:42:56 AM11/1/23
to tesseract-ocr
fyi, I asked the same question in https://groups.google.com/g/tesseract-ocr/c/9myrnSD0HKM

Dellu Bw

unread,
Nov 1, 2023, 7:57:46 AM11/1/23
to tesser...@googlegroups.com
On 1 Nov 2023 at 11:51:27 AM, TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍ <khanht...@khu.ac.kr> wrote:

Are you trying to generate box files from the images (tif files)?

Des Bw

unread,
Nov 1, 2023, 8:02:28 AM11/1/23
to tesseract-ocr

I don't know what you are trying to do. I am not familiar with this method of box generation. But, I think the command you are running is supposed to generate them with the same coordinates. Look at the example here:  https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html

Des Bw

unread,
Nov 1, 2023, 8:21:36 AM11/1/23
to tesseract-ocr
"Please note that box files generated using makebox config file are OK for training legacy models but not for LSTM training.". Makebox is the tool included inside tesseract to generate box files. It looks like that was used for the legacy model. For the current model, text2image is the way to do it.  
Message has been deleted

TRAN TRONG KHANH[학생](대학원 컴퓨터공학과) ‍

unread,
Nov 1, 2023, 8:57:48 AM11/1/23
to tesseract-ocr
Thank you for your responses. Regarding my question and referring to the official documentation at  https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html , the generated .box files for LSTM-based training have the same coordinates for every character because they use line-level boxes instead of character-level boxes.
Also, I have a couple of concerns:
1) I'm working on license plate recognition and have 80K car plate images with noise. Most of the .box files generated by lstmbox are incorrect compared with ground truth text. Manually editing all these box files will be very time-consuming. Do you have any suggestions to shorten the time?
2) Do I need to manually check all 80K box files to ensure the accuracy of my training data?

Des Bw

unread,
Nov 1, 2023, 9:20:09 AM11/1/23
to tesseract-ocr
 
1. using sythetic data: 
What can you do if you do not have a data that is confirmed to be accurate?
The only way around that I know  is to use sythetic data.  That is: you generate the images from the texts using text2image script. You then train from that one. The accuracy of the result model is not going to be perfect because the actual data is messier than the syntactic data. But, you can try  different methods to get better accuracy: 
(a) by training from a network: that is you can cut the top layer of a working model, and train from that one. 
(b) configure text2image script to add noise to the sythetic data so that it will be similar to the actual images. 
(c) using larger dataset
etc

2) the hocr hack: 
- I havn't tried this method myself. But, I read in GitHub that Shree has some kind of hack (script) that uses horc script inside tesseract.
a. First, ocr the images using the standard model  to an hocr format. 
b) he then breaks down the hocr format to box, tif, text files
c) he then compares the text files with the images, and manually corrects faulty ones. 
This one also requires a lot of manual work because the standard model will miss a lot of characters. 

3) Alternatively, you can try other ocr engines such as EasyOCr. Some people say EasOCR is better to ocr those kinds of images: while tesseract is better for scanned docs. 

Des Bw

unread,
Nov 1, 2023, 9:21:57 AM11/1/23
to tesseract-ocr
To clarify, Shree's script is useful in case your images are not single line. If they are all single line, that script won't do much for you. 

Durank

unread,
Nov 3, 2023, 1:57:40 PM11/3/23
to tesseract-ocr
please provide the link to download jTessBoxEditor to win 11 please?
Reply all
Reply to author
Forward
0 new messages