I have a document with one page per 600 dpi TIFF file. The original document was created 50 years ago, using a typewriter.
I want the OCR character and its position on the page, so I use makebox:
tesseract page.tiff box/page makebox
I get from this a file, box/page.box, which I can then use to outline the location of the boxes in my original file.
I get boxes that look like (the original TIFF file has only two colors -- black and white. In the following, I convert
the "recognized" bits to blue -- they are in a box -- and draw a one-pixel red box around the box defined by
tesseract):

The first line looks really good, but the second line shows "a" and "f" in their boxes, and then another box that includes both of them (but not all of the "a"),
and the same sort of thing later with the "h" and "i" -- both in their own boxes, and then another box thrown over the two of them. And then "nd" in one box instead of in separate boxes, even tho there is a clear gutter between them.
In another case, we get much more complex overlapping boxes:

The first "f" is not recognized at all, while the second one is split into 3 boxes -- two abut, but one is a really small box inside the top one. The "r" on the top line is two different boxes, as is the later "o", although the thickness of the interior red line for the "o" suggests it is actually 3 boxes, one being only one or two pixels wide. The "d" and "e" and "a" on the second line are each a bunch of overlapping boxes.
I guess my theory of how to do segmentation of the image would be to create a set of non-overlapping boxes and make sure that each black bit (or at least a cluster of black bits) is in one and only one box. Clearly this is not what tesseract does since some bits end up in many different boxes, and other bits are in no box at all. I have a table of contents page, where everything is recognized (not necessarily correctly, but at least it identifies the bits, except the column of page numbers which are just ignored completely, but that's probably a different problem.

The first problem is "How do I get non-overlapping boxes?"