Should box include surrounding space?

94 views
Skip to first unread message

Danny Wilson

unread,
Oct 17, 2023, 10:29:13 PM10/17/23
to tesser...@googlegroups.com
For purposes of training, I'm wondering if the box for a character should include the surrounding space.

In particular for the CJK "FULLWIDTH COMMA", should the box be the red or green rectangle?

PastedGraphic-2.png

Des Bw

unread,
Oct 18, 2023, 1:22:25 AM10/18/23
to tesseract-ocr
If the space is included in the training across the board, the model might not recognize  the comma when it appears without space  (as in numbers: 23,334). 

Danny

unread,
Oct 18, 2023, 5:35:01 AM10/18/23
to tesseract-ocr
There are a few "commas" used in CJK which makes it complicated for me.

FULLWIDTH COMMA U+FF0C (link) which might have the glyph in the center of the box or in the lower left corner depending on the font:

Screenshot 2023-10-18 at 17.19.27.png commaFullWidth.jpg

HALFWIDTH IDEOGRAPHIC COMMA U+FF64 (link) which (as far as I can tell) will always be in the bottom corner regardless of font. (used to enumerate sequences)
Screenshot 2023-10-18 at 17.23.33.png

COMMA U+002C, (link) which isn't part of formal CJK languages but in practice is used all the time
Screenshot 2023-10-18 at 17.21.50.png

So I'd like to train to recognize the three types of commas so the OCR output is matches the input images.  "FULLWIDTH COMMA" is a problem because the glyph position in the box is different depending on the font.  Hence my question "where and how big is the box?"

Screenshot 2023-10-18 at 17.28.40.png

In the image above, lines 1, 2, and 3 are all FULLWIDTH COMMA but line 1 is a different font.  Line 4 is COMMA (U+002C) while line 5 is HALFWIDTH IDEOGRAPHIC COMMA U+FF64.

What's the best way to train given those types of input and the expected output?

Danny

Des Bw

unread,
Oct 18, 2023, 8:43:51 AM10/18/23
to tesseract-ocr
You need a large  data. That is all. 
If you can collect a lot of text lines that contain all those types of commas: and produce the training material using text2image (synthetic data) for each font, I am pretty sure Tesseract will learn all of them with no problem. 

Des Bw

unread,
Oct 18, 2023, 8:45:20 AM10/18/23
to tesseract-ocr
But, if your options are only to manually edit the boxes, I really have no knowledge of it. I have never tried that route. 

Danny Wilson

unread,
Oct 18, 2023, 9:15:23 PM10/18/23
to tesser...@googlegroups.com
Because of some issues with licensed fonts not working with text2image, we wrote our own image and box file generator in Swift on the Mac.

We use that to generate a data set for 100,000 text lines and feed that into the regular training on Linux.

Using a non-licensed font, I checked what box text2image generated for the FULLWIDTH COMMA (should've done that earlier!)

text2ImageOut.png

So it looks like text2image uses the top base line for the box, which extends only as far down as the lowest extent of the glyph.  Such a box would differentiate between FULLWIDTH COMMA and COMMA if the font vertically centers FULLWIDTH COMMA.  

If the font renders FULLWIDTH COMMA on the text baseline, then the model would get confused between FULLWIDTH COMMA and COMMA since both are down on the baseline.

How does tesseract handle the whitespace to the left/right of a character?  Is there some kind parameter to set or would training with data containing both (baseline) FULLWIDTH COMMA and COMMA work?

Danny



--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/FJyyTpX1d7k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/df5cc0ce-7af3-4b57-a911-06fa18217e52n%40googlegroups.com.

Danny Wilson

unread,
Oct 19, 2023, 2:30:32 AM10/19/23
to tesser...@googlegroups.com
Sorry, I had the coordinate system flipped on my last post.

Here is a correct image produced by text2image and includes both FULLWIDTH COMMA and COMMA.
testFile.png

For both types of comma, the boxes produced by text2image include only the boundaries of the glyph itself and does not consider the vertical position.

I've trained using this type of ground truth but when running the OCR the latin COMMA is always output instead of the correct FULLWIDTH COMMA. That's wrong.

If the box only surrounds the glyph exactly, then I fear that no amount of training will enable the model to differentiate between the two types of comma.

Is there a way to tune the training process?

Or... if instead I render boxes for some special characters to extend from the text baseline, which would then differentiate between the mid-line and baseline commas (but still not differentiate the fonts that have both fullwidth and normal comma on the baseline...)

Anyone have some experience with that?

Thanks
Danny
Reply all
Reply to author
Forward
0 new messages