Hi Alex:
tesseract has a number of command line switches. You can see what the islandora_ocr module is currently using by having a look at this line:
https://github.com/Islandora/islandora_ocr/blob/7.x/includes/derivatives.inc#L26 If you run tesseract from the command line yourself you'll see the options it provides. I've posted them below for reference. The key switch you're looking for is the -psm switch. There are a variety of options there and -psm 1 may be what you need (or maybe it is the default already if you are getting OCR/HOCR). I'd do some local testing with the files you've got. It might be useful to add the -psm switch to the admin panel for the islandora_ocr module? Not sure as it would add time to the OCR process as the software rotates and tests whether or not it is OCRible.
Hope that helps.
Donald
Usage:
tesseract imagename|stdin outputbase|stdout [options...] [configfile...]
OCR options:
--tessdata-dir /path specify location of tessdata path
-l lang[+lang] specify language(s) used for OCR
-c configvar=value set value for control parameter.
Multiple -c arguments are allowed.
-psm pagesegmode specify page segmentation mode.
These options must occur before any configfile.
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
Single options:
-v --version: version info
--list-langs: list available languages for tesseract engine. Can be used with --tessdata-dir.
--print-parameters: print tesseract parameters to the stdout.