Thank you for the zoom buttons, they make it a lot easier to work with! Please forgive the delay in responding (had a bit of a computer issue, was able to save all my data, but took a couple days longer than expected to get everything all set up again). This will be a photo heavy post, easier to show what I'm working with, and the problems I'm having.
I have 3 main types of pages that are getting OCRd - new chapter, "normal" pages with images and pages with images that include drop caps. Except for new chapter pages (only because their text size is larger overall) some pages also have lighting issues leading to the thresholding not always able to get a go "picture" of every glyph. I am including here images showing each page type. Note, these aren't the images that are actually being fed to tesseract, they are jpg conversions of all processing except for mixed raster splitting and thresholding algorithm application (especially for the new chapters, I want to maintain the exact appearence, and having issues with splitting not handling the background correctly, and tesseract missing more characters than what my examples show.
First, a new chapter page
Next, a drop cap:
And a "normal"
Two special cases: This one has drop caps as a label for a step
And this one (which, honestly, could be left as just a image, no text layer, but it will show most of the recognition problems I'm having)
As you can see from this page, lighting is a little less than even, I can not rescan right now (the ideal solution for at least some of the problems). Running just the foreground/text layer (attached, not embedded if you want to take a look at the results yourself), with OEM 1, default PSM, and English (this is the only page that has any language except English or jargon, and I'm trying to do the tesseract runs in a script), I get the following (loaded into ScribeOCR for easier comparison). I did not do this run with hocr_char_boxes set, setting it does also help, *some*, at least in OEM 0.
Zoomed in on the chart to show the primary concerns I have -
As you can see, the fractions get miss identified, or skipped altogether (first 6 entries in the first column are 1/8, 1/4, 3/8, 1/2, 5/8, 3/4). So, I've got the best image I can right now, run OCR, and manually reviewing. Other than trying maybe a different PSM, I now need to manually fix. Characters that can be directly typed are easy to either add to the hocr file directly, or bring in to Scribe (or some other review assist software) and use the edit tools there. Unicode/ASCII characters are a bit harder (been using special character insert functions and the copy and paste the special character, at least until I can remember how to direct type the ascii code for them), but they do get in, at least where some character was already identified, when a "word" is completely ignored, or with the drop cap, I haven't yet figured out how to get the missing character(s) in.