Good day everyone! I was wondering if anyone had any insight as to how to deal with a seemingly tricky problem.
I started off ingesting newspapers with nothing but TIFFs and MODS. The toolchain did all the work of making the derivatives: e.g. OCR and HOCR (and PDFs... usually). Things were consistent. All was well.
But it had turned out we had paid our scanning provider to create OCR, so I was asked to include it as such at ingest time. The provider's OCR *was* marginally cleaner, and might have even been column-aware in some cases. But this introduced two subtle problems:
- Since the OCR was only text, with no word coordinates, OCR basically had to be done all over again just for HOCR.
- And now the OCR and HOCR text would be different, so occasionally people would get a word in search results that they couldn't find in the newspaper issue, and vice-versa
I was originally just willing to live with the inconsistency and put a note about it on the user's guide, but my colleagues expressed concern, and so I decided I was going to try to fix this problem.
I thought I could fix this problem by having search go for HOCR instead, but there's a lot of markup embedded in the HOCR in order for it to know where the words actually are, and it is causing problems like XML attribute name warnings being displayed on top of search results. I think it would be best to go back to using OCR. But then I'm left with the problem of the inconsistency.
What I want to do is somehow get all of those issues that I uploaded OCR with to forget that I supplied OCR and either redo the OCR (good grief, that will take ages) or preferably make the OCR that was done as part of HOCR the official OCR, without all the HOCR-attendant markup. Is this a thing that's possible?
(Maybe this is best dealt with as part of a migration to 8+ - just do a chunk of issues at a time, redoing all the derivatives and only taking the TIFFs and MODS and PIDs along for the ride. Would also take ages but it would make everything consistent and take advantage of the latest OCR and JP2 optimizations, assuming that's a thing that has happened over the years.)
Ideas appreciated! I also have a virtual machine instance I can do destructive things to without worrying about production - might try things there first!
Cheers,
William Matheson
Library Assistant - Technical
Prince Rupert Library