I think this google group is having technical troubles. I got an email about a new post from Menelik Berhan but his message doesn't appear on the web. He said:
Same what Tom said. Very helpful!
To summarize:
- Box files always contain one line per character
- There are two kinds of box files: per-character and per-line box files
- per-character box files have separate coordinates for each character
- per-line box files still have one line per character, but the coordinates are always the same and represent the bounding box of the entire text
The training code, specifically Tesseract::TrainFromBoxes(), should accept either format.
As mentioned in this and other posts, the box identification for Chinese seems to be quite broken. Like this:
That might or might not be a training issue, but I will try retraining the model using per-line box files and see if that makes any difference.
Thanks to all.