Should box include surrounding space?

Danny Wilson

unread,

Oct 17, 2023, 10:29:13 PM10/17/23

to tesser...@googlegroups.com

For purposes of training, I'm wondering if the box for a character should include the surrounding space.

In particular for the CJK "FULLWIDTH COMMA", should the box be the red or green rectangle?

PastedGraphic-2.png

Des Bw

unread,

Oct 18, 2023, 1:22:25 AM10/18/23

to tesseract-ocr

If the space is included in the training across the board, the model might not recognize the comma when it appears without space (as in numbers: 23,334).

Danny

unread,

Oct 18, 2023, 5:35:01 AM10/18/23

to tesseract-ocr

There are a few "commas" used in CJK which makes it complicated for me.

FULLWIDTH COMMA U+FF0C (link) which might have the glyph in the center of the box or in the lower left corner depending on the font:

HALFWIDTH IDEOGRAPHIC COMMA U+FF64 (link) which (as far as I can tell) will always be in the bottom corner regardless of font. (used to enumerate sequences)

COMMA U+002C, (link) which isn't part of formal CJK languages but in practice is used all the time

So I'd like to train to recognize the three types of commas so the OCR output is matches the input images. "FULLWIDTH COMMA" is a problem because the glyph position in the box is different depending on the font. Hence my question "where and how big is the box?"

In the image above, lines 1, 2, and 3 are all FULLWIDTH COMMA but line 1 is a different font. Line 4 is COMMA (U+002C) while line 5 is HALFWIDTH IDEOGRAPHIC COMMA U+FF64.

What's the best way to train given those types of input and the expected output?

Danny

Des Bw

unread,

Oct 18, 2023, 8:43:51 AM10/18/23

to tesseract-ocr

You need a large data. That is all.

If you can collect a lot of text lines that contain all those types of commas: and produce the training material using text2image (synthetic data) for each font, I am pretty sure Tesseract will learn all of them with no problem.

Des Bw

unread,

Oct 18, 2023, 8:45:20 AM10/18/23

to tesseract-ocr

But, if your options are only to manually edit the boxes, I really have no knowledge of it. I have never tried that route.

Danny Wilson

unread,

Oct 18, 2023, 9:15:23 PM10/18/23

to tesser...@googlegroups.com

Because of some issues with licensed fonts not working with text2image, we wrote our own image and box file generator in Swift on the Mac.

We use that to generate a data set for 100,000 text lines and feed that into the regular training on Linux.

Using a non-licensed font, I checked what box text2image generated for the FULLWIDTH COMMA (should've done that earlier!)

So it looks like text2image uses the top base line for the box, which extends only as far down as the lowest extent of the glyph. Such a box would differentiate between FULLWIDTH COMMA and COMMA if the font vertically centers FULLWIDTH COMMA.

If the font renders FULLWIDTH COMMA on the text baseline, then the model would get confused between FULLWIDTH COMMA and COMMA since both are down on the baseline.

How does tesseract handle the whitespace to the left/right of a character? Is there some kind parameter to set or would training with data containing both (baseline) FULLWIDTH COMMA and COMMA work?

Danny

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/FJyyTpX1d7k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/df5cc0ce-7af3-4b57-a911-06fa18217e52n%40googlegroups.com.

Danny Wilson

unread,

Oct 19, 2023, 2:30:32 AM10/19/23

to tesser...@googlegroups.com

Sorry, I had the coordinate system flipped on my last post.

Here is a correct image produced by text2image and includes both FULLWIDTH COMMA and COMMA.

For both types of comma, the boxes produced by text2image include only the boundaries of the glyph itself and does not consider the vertical position.

I've trained using this type of ground truth but when running the OCR the latin COMMA is always output instead of the correct FULLWIDTH COMMA. That's wrong.

If the box only surrounds the glyph exactly, then I fear that no amount of training will enable the model to differentiate between the two types of comma.

Is there a way to tune the training process?

Or... if instead I render boxes for some special characters to extend from the text baseline, which would then differentiate between the mid-line and baseline commas (but still not differentiate the fonts that have both fullwidth and normal comma on the baseline...)

Anyone have some experience with that?

Thanks

Danny

Reply all

Reply to author

Forward