Tesseract training ground truth: I'm confused about the box files

527 views
Skip to first unread message

Mateusz Matela

unread,
Jul 10, 2024, 9:17:07 AM7/10/24
to tesseract-ocr
Hi all,

Sorry if double posting, my previous message didn't appear and I don't see any info about waiting for acceptance or something.
I was searching for this topic in this forum and it was mentioned a few times, but I couldn't find a clear and definitive explanation.

How does the information put in the .box files affect the training process? The file contains coordinates for each character in the txt file, but the documentation says that since Tesseract 4.0 the model operates on the level of whole lines. Some tools like text2image generate the .box files with accurate coordinates for each character. When the .box files are missing the tesstrain Makefile generates them using generate_line_box.py, which assigns the same full image area to each character.

I see 3 possible conclusions, which one is closest to the truth?

1. The .box files do not affect the LSTM training at all and are just a leftover from the times of Tesseract 3. In that case, ideally in the future they could be completely dropped or only required/generated when specifically working with the legacy engine.

2. There is still a chance that training will work better with exact coordinates and the generate_line_box.py is just a cheap workaround that could be improved on in the future.

3. The .box file is still important in case you prefer to define the coordinates for the text in the image instead of cropping the image. The granularity of the coordinates is not imporant as Tesseract will just work on a box that encapsulates all of the character boxes. Even if confusing, this approach is still better than having a different .box file formats for LSTM and the legacy engine.

I'll be grateful for any wisdom on this.

Thanks
Mateusz

Mateusz Matela

unread,
Jul 12, 2024, 8:14:50 AM7/12/24
to tesseract-ocr
As an experiment, I run the training on a small sample produced with text2image. Then I converted the .box files so that each character is assigned common bounding rectangle from all the characters and run the training again. The outputs were identical in both cases. Then I removed the box file and let the training script autogenerate them. In that case the reported error rates were crazy, like 99% instead of 0.5%.
This suggests that conclusion 3 is correct.

Zdenko Podobny

unread,
Jul 14, 2024, 9:05:48 AM7/14/24
to tesser...@googlegroups.com
Ehm:
  1. Tesseract v3 (legacy) engine training is based on characters.
  2. Tesseract LSTM engine (tesseract >=v4) training script is based on lines (group of words)
Box files reflect that. And yes - box files are important.


Zdenko


pi 12. 7. 2024 o 14:14 Mateusz Matela <mateusz...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b17225d5-2b78-41bd-994f-05305b9a443dn%40googlegroups.com.

Danny

unread,
Sep 3, 2024, 10:04:32 AM9/3/24
to tesseract-ocr
@zdenop wrote:
| Tesseract LSTM engine (tesseract >=v4) training script is based on lines (group of words)
| Box files reflect that. And yes - box files are important.

Zdenko, does this mean a "box file" for LSTM training should wrap the entire text line and NOT the individual characters?
Which is correct for LSTM training:

A) individual boxes like this, or
sub_2.png
B) One box for entire line:
sub_2 line.png
Thanks.

Zdenko Podobny

unread,
Sep 5, 2024, 9:15:12 AM9/5/24
to tesser...@googlegroups.com
have a look at provided example  ocrd-testset.zip 

Zdenko


ut 3. 9. 2024 o 16:04 'Danny' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

Danny

unread,
Sep 5, 2024, 11:41:50 AM9/5/24
to tesseract-ocr
Hi Zdenko,
Thanks for the response.  However, ocrd-testset.zip contains training images and ground truth text without boxes.

True, the images contain a full line of text:
alexis_ruhe01_1852_0099_012.png

But there are no box files in the training set.  

I'd like to confirm if the LSTM training set's xxx.box file is expected contain one box per line (wrapping the entire line) or one box per character in the line...  Any insight?
Message has been deleted

Zdenko Podobny

unread,
Sep 5, 2024, 3:02:56 PM9/5/24
to tesser...@googlegroups.com
What about reading tesstrain Readme and using the example data to understand the training process better?

Zdenko


št 5. 9. 2024 o 17:41 'Danny' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

Mateusz Matela

unread,
Sep 5, 2024, 3:15:05 PM9/5/24
to tesseract-ocr
See my first answer, I've run an experiment and the training went exactly the same with both approaches (separate box per character or the same line-box for all characters).

Mateusz

Tom Morris

unread,
Sep 6, 2024, 11:18:44 AM9/6/24
to tesseract-ocr
That's weird. I posted an answer to this thread yesterday and now, in it's place, Google Groups says "Message has been deleted." Let me try again...

says "lstmbox - Generated by tesseract using lstmbox config from image files - each char uses coordinates of its entire line. This format is also generated by the tesstrain makefile."

Tom
Message has been deleted

Danny

unread,
Sep 6, 2024, 9:55:15 PM9/6/24
to tesseract-ocr
I think this google group is having technical troubles.  I got an email about a new post from Menelik Berhan but his message doesn't appear on the web.  He said:


Same what Tom said. Very helpful!

To summarize:
- Box files always contain one line per character
- There are two kinds of box files: per-character and per-line box files
- per-character box files have separate coordinates for each character
- per-line box files still have one line per character, but the coordinates are always the same and represent the bounding box of the entire text

The training code, specifically Tesseract::TrainFromBoxes()should accept either format.

As mentioned in this and other posts, the box identification for Chinese seems to be quite broken. Like this:
Screenshot 2024-08-05 at 17.56.12.png

That might or might not be a training issue, but I will try retraining the model using per-line box files and see if that makes any difference.

Thanks to all.

Zdenko Podobny

unread,
Sep 7, 2024, 5:26:40 AM9/7/24
to tesser...@googlegroups.com
tesstrain is a tested method to train/improve tesseract language mode. It creates box files for you. 
You can try your ways, but your problems are your problems and you should not to expect somebody will adjust the code to your needs.
Of course, you are welcome to contribute your solution.

Zdenko


so 7. 9. 2024 o 3:55 'Danny' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages