Sideways pages/OCR

Alex Kent

unread,

Aug 26, 2015, 4:02:02 PM8/26/15

to islandora

We have a case where we are using the Book SP and the book objects have several pages that were originally scanned sideways. The main issue is if we want to leave them scanned sideways, rather than the usual vertical pages, then OCR will not work on them. Has anyone dealt with sideways pages for book objects, and gotten OCR working?

We are on 1.4 and Drupal 7.

Thanks in advance.

Mark Jordan

unread,

Aug 26, 2015, 4:07:38 PM8/26/15

to isla...@googlegroups.com

Alex,

If the number of sideways pages is low, you could rotate copies of the TIFFs to the correct reading orientation, OCR them outside of Islandora, and then add the OCR datastream to the page objects. You wouldn't get in-image keyword highlighting but ordinary searches should work fine.

Mark

We have a case where we are using the Book SP and the book objects have several pages that were originally scanned sideways. The main issue is if we want to leave them scanned sideways, rather than the usual vertical pages, then OCR will not work on them. Has anyone dealt with sideways pages for book objects, and gotten OCR working?

We are on 1.4 and Drupal 7.

Thanks in advance.

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at http://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/ed52be19-6e4b-433a-af43-f11414810393%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Kent

unread,

Aug 26, 2015, 4:09:42 PM8/26/15

to islandora

I think the number of sideways pages would be higher. Would there be any way to get the OCR working for the sideways pages, without doing any rotation of them?

Thanks Mark.

Mark Jordan

unread,

Aug 26, 2015, 4:23:25 PM8/26/15

to isla...@googlegroups.com

Hi Alex,

I'm sorry, I know almost nothing about tesseract, but there is a JIRA issue that says tesseract 3.01 has a switch to autodetect page orientation:

https://jira.duraspace.org/browse/ISLANDORA-440

It looks like the issue was closed with a documentation fix, but maybe it cold be reopened as an improvement ticket against Islandora OCR.

Mark

To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/246a427a-df5e-4c18-a5c9-2f05464f9172%40googlegroups.com.

Alex Kent

unread,

Aug 27, 2015, 2:49:36 PM8/27/15

to islandora

Thanks Mark. As it turns out, HOCR takes care of the problem - it was able to "capture" the text from the sideways pages. So I think we are good.

Donald Moses

unread,

Aug 28, 2015, 8:22:40 AM8/28/15

to islandora

Hi Alex:

tesseract has a number of command line switches. You can see what the islandora_ocr module is currently using by having a look at this line:
https://github.com/Islandora/islandora_ocr/blob/7.x/includes/derivatives.inc#L26
If you run tesseract from the command line yourself you'll see the options it provides. I've posted them below for reference. The key switch you're looking for is the -psm switch. There are a variety of options there and -psm 1 may be what you need (or maybe it is the default already if you are getting OCR/HOCR). I'd do some local testing with the files you've got. It might be useful to add the -psm switch to the admin panel for the islandora_ocr module? Not sure as it would add time to the OCR process as the software rotates and tests whether or not it is OCRible.

Hope that helps.

Donald

Usage:
tesseract imagename|stdin outputbase|stdout [options...] [configfile...]

OCR options:
--tessdata-dir /path    specify location of tessdata path
-l lang[+lang]    specify language(s) used for OCR
-c configvar=value    set value for control parameter.
            Multiple -c arguments are allowed.
-psm pagesegmode    specify page segmentation mode.
These options must occur before any configfile.

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

Single options:
-v --version: version info
--list-langs: list available languages for tesseract engine. Can be used with --tessdata-dir.
--print-parameters: print tesseract parameters to the stdout.

Reply all

Reply to author

Forward