Correcting OCR?

287 views
Skip to first unread message

Alex Kent

unread,
Dec 8, 2014, 12:04:50 PM12/8/14
to isla...@googlegroups.com
Is there a way to manually correct the OCR datastream?  For example, if you want to add a name to the OCR text, or change a name.   Thanks very much! 

Diego Pino

unread,
Dec 8, 2014, 2:07:20 PM12/8/14
to isla...@googlegroups.com
Hi Alex,
Haven't tested it, but there is a permission configuration(drupal via Islandora, Islandora paged content) for that specific task, so it should be possible. Enable this permission for your role and try editing the OCR stream.
It's in permissions:
under:
Islandora Paged Content
Edit OCR stream

Dieog

Alex Kent

unread,
Dec 9, 2014, 10:39:35 AM12/9/14
to isla...@googlegroups.com
Thanks! It looks like this working.  I've created documentation on this, it can be found at:

Diego Pino

unread,
Dec 9, 2014, 2:52:28 PM12/9/14
to isla...@googlegroups.com
Nice!

Kara Reuter

unread,
Dec 10, 2014, 12:19:12 PM12/10/14
to isla...@googlegroups.com
Thanks for this, both of you!  This leads me to another, related question -- is there any way to correct the HOCR datastream?  Thanks!

Donald Moses

unread,
Dec 10, 2014, 3:34:51 PM12/10/14
to isla...@googlegroups.com
Hi Kara:

I had a similar question as well.  Editing the HOCR makes more sense to me as it could be transformed to OCR by stripping the tags. The streams would be synced with the corrections. 

For newspaper page correction especially ... displaying a paragraph of HOCR for correction would seem to be a good approach.

There's an interesting Firefox plugin [1] for editing HOCR ... would have to be wired in.

Donald

[1] https://addons.mozilla.org/en-US/firefox/addon/hocr-editor/

Diego Pino

unread,
Dec 11, 2014, 8:18:10 AM12/11/14
to isla...@googlegroups.com

Hi, i found also this nice python scripts. https://github.com/tmbdev/hocr-tools

Reading the source code i find most of the logic simple to implement in PHP + JS. 

This could make a nice add-on module that implements a new editing form for this DS. If someone is really (really) needing this i could try to implement some basic functionality and share it through git so other can refine it. Lets make a poll and ask around!

Have a nice day

Alex Kent

unread,
Dec 11, 2014, 8:28:21 AM12/11/14
to isla...@googlegroups.com
This sounds like a really useful module idea.  +1 from me to build it.  There is strong interest in editing OCR text from one of our sites, and I know the others are interested/would be interested as well. 

For now, I'm going to recommend that they go ahead and edit the OCR datastream to make changes if they need to.   Or would you recommend not doing that?  

Thanks!

Mark Jordan

unread,
Dec 11, 2014, 10:01:27 AM12/11/14
to isla...@googlegroups.com
Diego, SFU votes +1 for editing the HOCR and deriving the OCR ds from it (as Don described).

Mark


--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at http://groups.google.com/group/islandora.
For more options, visit https://groups.google.com/d/optout.

Diego Pino

unread,
Dec 11, 2014, 2:13:05 PM12/11/14
to isla...@googlegroups.com
Hi, great +2 is enough to get me motivated =).
Just give me a few days (5?) to have some working prototype and i keep the conversation in this same post.

See you around.

Donald Moses

unread,
Dec 11, 2014, 2:40:35 PM12/11/14
to isla...@googlegroups.com
Thanks Diego!
If you need a use case fleshed out or a tester, let me know. I'd be happy to help.
Donald

Peter MacDonald

unread,
Sep 21, 2017, 2:29:56 PM9/21/17
to islandora
Is there any module that offers a permission option such as "Edit TRANSCRIPT stream" similar to the "Edit OCR stream" permission currently available in the "Islandora Paged Content"?

I vaguely recall that there was a standalone module (or just some code) for this being worked on for Islandora, but I haven't been able to locate it anyway.

I can always just download the datastream, edit it, and re-upoad it, but I'd like to grant that permission on a granular bases?

Peter MacDonald

Alex Kent

unread,
May 3, 2018, 9:55:44 AM5/3/18
to islandora
Reviving this to ask, is there a way to edit the HOCR datastream the same way as the OCR datastream? We have seen that the HOCR does not get updated after the OCR datastream is edited.  Should it? Is there a step or a configuration setting we are missing to have it get updated with the edits? 

Thank you in advance!

Alex 

dp...@metro.org

unread,
May 3, 2018, 11:12:44 AM5/3/18
to islandora
Hi Alex
OCR and HOCR are not mutual dependants in the derivative workflow. So a change on OCR will never trigger a new HOCR, especially because there is no way you can generate a HOCR from an OCR. First keeps track of the position of each letter/word inside the source image, so fully dependant on a source image, v/s OCR is just extracted text and could be moved around even if the source image would scale, rotate, etc. but, you could generate OCR from HOCR, so if that is needed that could be a piece of code that someone could contribute and write.

Best

Diego
Reply all
Reply to author
Forward
0 new messages