OCR management on 7.x

wgma...@gmail.com

unread,

Apr 15, 2021, 1:43:22 PM4/15/21

to islandora

Good day everyone! I was wondering if anyone had any insight as to how to deal with a seemingly tricky problem.

I started off ingesting newspapers with nothing but TIFFs and MODS. The toolchain did all the work of making the derivatives: e.g. OCR and HOCR (and PDFs... usually). Things were consistent. All was well.

But it had turned out we had paid our scanning provider to create OCR, so I was asked to include it as such at ingest time. The provider's OCR *was* marginally cleaner, and might have even been column-aware in some cases. But this introduced two subtle problems:

- Since the OCR was only text, with no word coordinates, OCR basically had to be done all over again just for HOCR.

- And now the OCR and HOCR text would be different, so occasionally people would get a word in search results that they couldn't find in the newspaper issue, and vice-versa

I was originally just willing to live with the inconsistency and put a note about it on the user's guide, but my colleagues expressed concern, and so I decided I was going to try to fix this problem.

I thought I could fix this problem by having search go for HOCR instead, but there's a lot of markup embedded in the HOCR in order for it to know where the words actually are, and it is causing problems like XML attribute name warnings being displayed on top of search results. I think it would be best to go back to using OCR. But then I'm left with the problem of the inconsistency.

What I want to do is somehow get all of those issues that I uploaded OCR with to forget that I supplied OCR and either redo the OCR (good grief, that will take ages) or preferably make the OCR that was done as part of HOCR the official OCR, without all the HOCR-attendant markup. Is this a thing that's possible?

(Maybe this is best dealt with as part of a migration to 8+ - just do a chunk of issues at a time, redoing all the derivatives and only taking the TIFFs and MODS and PIDs along for the ride. Would also take ages but it would make everything consistent and take advantage of the latest OCR and JP2 optimizations, assuming that's a thing that has happened over the years.)

Ideas appreciated! I also have a virtual machine instance I can do destructive things to without worrying about production - might try things there first!

Cheers,

William Matheson
Library Assistant - Technical
Prince Rupert Library

Rodney Bruce

unread,

Apr 15, 2021, 2:35:20 PM4/15/21

to islandora

William,

I recently had a problem where I had to regenerate HOCR for several thousand newspaper pages. I used the Islandora Datastream CRUD module to accomplish this with Drush.

Here are the steps I followed:

1. Generate a list of pids for all of the issues

(Note: This is on Ubuntu 18.04 using Bash.)

cd /var/www/drupal

drush -y --user=1 --uri=islandora.foobar.edu islandora_datastream_crud_fetch_pids --namespace=FOOrepository --is_member_of=<newspaperPID> > /home/rod/tmp/foobar.txt

2. Generate a list of pids of all the pages of the issues listed in step 1

for pname in `cat /home/rod/tmp/foobar.txt`; do drush -y --user=1 --uri=islandora.foobar.edu islandora_datastream_crud_fetch_pids --namespace=FOOrepository --is_member_of=$pname > /home/rod/tmp/newspaper/${pname}.txt; done

After this step you should have a bunch of files in a directory, one file per issue each file containing the PIDS for that issue's pages.

3. Delete the OCR and HOCR from each page and then regenerate the OCR and HOCR.

I did not want to do them all at once, so I would move 50 files from the directory /home/rod/tmp/newspaper to another directory (e.g. /home/rod/tmp/work ) and I would process those files, delete them, and move 50 more. I added the sleep at the end of the loop in case there was other indexing that needed to sneak in between processing.

pdir=/home/rod/tmp/work

for fname in `ls -1 ${pdir}`

do drush islandora_datastream_crud_delete_datastreams -y --user=1 --uri=islandora.foobar.edu --dsid=OCR --pid_file=${pdir}/${fname};

drush islandora_datastream_crud_delete_datastreams -y --user=1 --uri=islandora.foobar.edu --dsid=HOCR --pid_file=${pdir}/${fname};

drush islandora_datastream_crud_generate_derivatives -y --user=1 --uri=islandora.foobar.edu --source_dsid=OBJ --dest_dsids=OCR,HOCR --pid_file=${pdir}/${fname};

sleep 3;

done

I hope this helps.

wgma...@gmail.com

unread,

Apr 19, 2021, 4:10:04 PM4/19/21

to islandora

Hi Rod,

Thanks very much for the suggestion! I do intend to try it out sometime, although I also need to check out the state of things with Islandora 8+, which might obviate some of these issues (and perhaps introduce new ones!)

One thing I'm wondering - does Tesseract as used in Islandora 7.x for batch newspaper ingests employ "orientation and script detection"? (OSD) Apparently this helps it recognize multi-column documents, but it's not the default setting: https://stackoverflow.com/a/52102891/1736461