Because of some issues with licensed fonts not working with text2image, we wrote our own image and box file generator in Swift on the Mac.
We use that to generate a data set for 100,000 text lines and feed that into the regular training on Linux.
Using a non-licensed font, I checked what box text2image generated for the FULLWIDTH COMMA (should've done that earlier!)
So it looks like text2image uses the top base line for the box, which extends only as far down as the lowest extent of the glyph. Such a box would differentiate between FULLWIDTH COMMA and COMMA if the font vertically centers FULLWIDTH COMMA.
If the font renders FULLWIDTH COMMA on the text baseline, then the model would get confused between FULLWIDTH COMMA and COMMA since both are down on the baseline.
How does tesseract handle the whitespace to the left/right of a character? Is there some kind parameter to set or would training with data containing both (baseline) FULLWIDTH COMMA and COMMA work?
Danny