Book ingest PDF with embedded text layer; batch ingest PDF to book

273 views
Skip to first unread message

patrick....@commonmediainc.com

unread,
Nov 9, 2016, 12:39:46 PM11/9/16
to islandora
I have PDF files which consist of scanned book pages together with (what is described as) an invisible OCR'd text layer. (Here's something I found explaining that: http://www.searchable-pdf.com/content.php?lang=en&c=61). 

We will be ingesting a few thousand of these files into islandora books. My client prefers to use the embedded OCR text, rather than have Islandora/tesseract generate the OCR, if that is possible. But we need to support the IA Book Reader's text search feature. The "extract text from PDF" option for book ingest does not work with a scanned PDF, even if it has a text layer. 

Has anyone met this problem and found a solution?

Second question: Is there an easy way to ingest books from PDFs as a batch process?

Peter MacDonald

unread,
Nov 9, 2016, 12:48:43 PM11/9/16
to isla...@googlegroups.com
​​
Patrick:

I am currently left with the same situation: having custom PDF files with embedded renderable text that I want to ingest in a book ingest package of TIFF files.

We need a way to extract the PDF's embedded text into a an external txt file, say, FULL_TEXT.txt, that we add to the ingest package with the MODS and TIFF files. Even if we can accomplish this part, I'm not sure the book content model allows the ingest of the FULL_TEXT datastream. Only testing will tell.

Peter MacDonald


--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/39f2fe12-a6e6-490e-98bb-f04088485841%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Peter MacDonald,
Library Information Systems Specialist
Hamilton College Library
Clinton, New York
315 859-4493
pmacdona-hamilton (Skype)

patrick....@commonmediainc.com

unread,
Nov 9, 2016, 12:57:47 PM11/9/16
to islandora
I'm not sure what you mean by "embedded renderable text", but I it sounds different than the invisible text layer in my files, which is not 'renderable' as I understand the term. Do you mean that your PDFs do not consist of scanned pages?

In any case, extracting that text from the PDF and ingesting it as a separate datastream does not seem to provide any help for the problem of having this text be searchable in IA Book Reader.

Peter MacDonald

unread,
Nov 9, 2016, 1:03:21 PM11/9/16
to isla...@googlegroups.com
Patrick:

Oh, I see that your use case is quite different from ours.

Although we ingest the TIFFs using the book CM we do not use IA as the viewer. We use pdf.js as the viewer. Users, thus, see only the PDF and users search is the Solr index of the FULL_TEXT datastream created from the PDF.

This is, as you pointed out, different from making the PDF's text available for IA searching.

Peter

On Wed, Nov 9, 2016 at 12:57 PM, <patrick....@commonmediainc.com> wrote:
I'm not sure what you mean by "embedded renderable text", but I it sounds different than the invisible text layer in my files, which is not 'renderable' as I understand the term. Do you mean that your PDFs do not consist of scanned pages?

In any case, extracting that text from the PDF and ingesting it as a separate datastream does not seem to provide any help for the problem of having this text be searchable in IA Book Reader.

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.

For more options, visit https://groups.google.com/d/optout.

Giancarlo Birello

unread,
Nov 9, 2016, 1:37:33 PM11/9/16
to isla...@googlegroups.com

Hi,

I manage pdf/a (pdf scanned + OCR text) as book and I use this steps:

- pdftk + imagemagick to generate tiff, 1 file x page

- docsplit utility to extract text from pdf pages, 1 file x page

- prepare dir structure as needed by book batch ingesting (1 dir x page with OBJ.tif, OCR.txt, DC.xml, ...)

- batch ingest (see islandora book ingest module)

but ...

while OCR.txt is indexed by solr and used by simple or advanced search block, IA uses HOCR datastream that at the moment is generated by tesseract during ingesting derivatives generation, I searched but I didn't found any way to generate HOCR from pdf/a directly,

so I have a full-text search based on OCR datastream while IA search is based on HOCR datastream, at the moment this is ok for me.

Sorry for my confused explication ...

cheers

Giancarlo

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.

patrick....@commonmediainc.com

unread,
Nov 10, 2016, 12:25:00 PM11/10/16
to islandora
Thank you so much Giancarlo! 

Unless anyone has other ideas, I'll take this as the accepted answer.

Jared Whiklo

unread,
Nov 10, 2016, 1:45:08 PM11/10/16
to isla...@googlegroups.com
I agree with Giancarlo.

If you generate your derivatives prior to ingest it makes ingest much
faster as a side benefit. Whether they be extracted PDF ocr or just
regular thumbnails.

cheers,
jared

On 2016-11-10 11:25 AM, patrick....@commonmediainc.com wrote:
> Thank you so much Giancarlo!
>
> Unless anyone has other ideas, I'll take this as the accepted answer.
>
> --
> For more information about using this group, please read our Listserv
> Guidelines: http://islandora.ca/content/welcome-islandora-listserv
> ---
> You received this message because you are subscribed to the Google
> Groups "islandora" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to islandora+...@googlegroups.com
> <mailto:islandora+...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/islandora.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/islandora/d9d8d09c-604a-4ef2-898e-f2fb2582aa72%40googlegroups.com
> <https://groups.google.com/d/msgid/islandora/d9d8d09c-604a-4ef2-898e-f2fb2582aa72%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Jared Whiklo
jwh...@gmail.com
--------------------------------------------------
I intend to live forever--so far so good.

signature.asc
Message has been deleted

Danielle Robichaud

unread,
May 9, 2017, 5:47:12 PM5/9/17
to islandora
I'm currently working through a similar issue and wonder if I missed any developments on this front.

We have a series of digitized books/diaries/newsletters, many of which have corresponding text files that were generated via Adobe text recognition or transcription. The items are predominantly archival so the presence of typewritten or handwritten text means that Tesseract is generating - to be generous - less than ideal text files. (I've attached examples of a text file we generated and a text file Tesseract generated upon ingestion.)

Our ideal use case is having these items viewed by way of the IA Book viewer, but the inability to bypass or replace the generated Tesseract files so that the corresponding text files are the basis for the OCR and HOCR datastreams is making it difficult to ensure the files are accessible or reliably searchable. Giancarlo's approach of ingesting OCR text files for each page so that the simple and advanced search returns a hit is a good one, For us, though, it's next to impossible to find out where it appears in the book once you click into the IA viewer because the Tesseract files used for the IA search are so poor. It's not the end of the world for short files, but some of books are 100+ pages long.

I have considered abandoning the Book IA for the PDF content model, but there are a number of items we'd like to make viewable online, but not downloadable, so a content model switch only solves part of the problem.

I've had a look through the current tickets and don't believe this is being worked on, so I acknowledge that I'm asking about developments as an act of wishful thinking..!
UWGeneratedOCR.JPG
TesseractGeneratedOCR.JPG

Giancarlo Birello

unread,
May 10, 2017, 2:25:29 AM5/10/17
to isla...@googlegroups.com

Hi Danielle,

I'm not sure I understood, if the problem is replace current OCR datastream with external generated, you can achieve that with the powerful module CRUD (thanks thanks Mark) https://github.com/mjordan/islandora_datastream_crud.

We works with external generated OCR and tesseract HOCR, I know could be differences between OCR and HOCR but take in account that HOCR is used only for IAB internal search while Solr full-text search relies on OCR datastream, so the result could be a good full-text search with no IAB text highlighting.
For us (and for the users) is more important use IAB than avoid this little differences.

Have a good day,

Giancarlo

--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Message has been deleted

Danielle Robichaud

unread,
May 10, 2017, 9:18:03 AM5/10/17
to islandora
Hi Giancarlo,

Thanks for pointing me to CRUD - this is the my first time it's crossing my radar.

Boiling it right down: Not being able to replace the HOCR is the issue I'm trying to think through.

If, for example, we suppress the book pages from the search results (https://jira.duraspace.org/browse/ISLANDORA-1533), which we were planning on doing, people won't be able to find the page where their search term hit once they click into the IA Book view because the HOCR datastream is so poor. I've attached more screenshots to illustrate what I'm getting at.

Based on what I can tell there are three options:

1) Switch to the PDF model
2) Suppress the pages and live with people having to look for the term themselves.
3) Leave the pages in the search results so people are, at least, directed to the page where the term hit, even if they have to look for it themselves once they get there.

My biggest barrier is not having enough tech to attempt to fix these things, but understanding more than most admins so that I'm perpetually pretty sure something is possible, but I'm never completely sure if I'm aiming too high.

Am I completely off the mark in thinking that it's possible to develop the option, at ingest, to pick either the embedded text file(s) OR the Tesseract output as the basis for the HOCR datastream(s)? Because if I'm not, maybe this is something UW Library can take a stab at making it happen and push back out to the community.

tl;dr Sometimes I want things to happen that aren't possible, but I'm stubborn. And possibly naive.

Thanks for the input!



post-ingestion view.JPG
ocrupdated.JPG
nodice.jpg

patrick....@commonmediainc.com

unread,
May 10, 2017, 9:48:01 AM5/10/17
to islandora
I did some searching for tools that will extract the text layer from PDF/A into HOCR, and so far have not found any. 

I suppose that the rationale is that you should use whatever system was used to originally generate the PDF/A files to export HOCR. That seems like a reasonable suggestion: can you use the system that generated the PDF/A files to generate image files and HOCR?

If that's not possible, then finding a workflow that extracts HOCR from PDF/A is still the only option (assuming tesseract's HOCR is not usable, which this discussion does assume). I did a little googling and found that there are a number of tools that will extract the text layer from PDF/A, with coordinates: http://stackoverflow.com/questions/6187250/pdf-text-extraction. There may be other tools that will generate HOCR from such output, or perhaps one of these tools will generate HOCR directly.

Giancarlo Birello

unread,
May 10, 2017, 9:50:54 AM5/10/17
to isla...@googlegroups.com

On 10/05/2017 15:18, Danielle Robichaud wrote:
Hi Giancarlo,

Thanks for pointing me to CRUD - this is the my first time it's crossing my radar.

Boiling it right down: Not being able to replace the HOCR is the issue I'm trying to think through.
You can achieve this by CRUD.


If, for example, we suppress the book pages from the search results (https://jira.duraspace.org/browse/ISLANDORA-1533), which we were planning on doing, people won't be able to find the page where their search term hit once they click into the IA Book view because the HOCR datastream is so poor. I've attached more screenshots to illustrate what I'm getting at.

Based on what I can tell there are three options:

1) Switch to the PDF model
2) Suppress the pages and live with people having to look for the term themselves.
3) Leave the pages in the search results so people are, at least, directed to the page where the term hit, even if they have to look for it themselves once they get there.
We choose to leave pages in the search results, at least user is direct to the page in IAB where term can be found.
We limited pages DC to only title(Page number) and date, so the search results are more consistent.
In addition we give to DC elements a weight higher than OCR in admin/islandora/search/islandora_solr/settings "Query fields" (i.e. dc.title^8 dc.subject^5 dc.description^5 dc.creator^5 OCR_t^1).
We would like to modify search results (as our old islandora 6) to show book thumbnail near page when full text is retrieved (ToDO).



My biggest barrier is not having enough tech to attempt to fix these things, but understanding more than most admins so that I'm perpetually pretty sure something is possible, but I'm never completely sure if I'm aiming too high.

Am I completely off the mark in thinking that it's possible to develop the option, at ingest, to pick either the embedded text file(s) OR the Tesseract output as the basis for the HOCR datastream(s)? Because if I'm not, maybe this is something UW Library can take a stab at making it happen and push back out to the community.
Book batch ingesting make possible (https://wiki.duraspace.org/display/ISLANDORA/Islandora+Book+Batch) to provide your own pre-generated HOCR.
I prepare (I = bash script) a folder x book, within the folder a folder x page, within page folder any files for page datastreams I want to direct ingesting (i.e.  DC.xml  HOCR.html  OBJ.tif  OCR.txt  PDF.pdf): when ingesting the datastreams already present in the folder are ingested directly without regeneration.

If your answer is "can I make HOCR starting from a good OCR?" I was googling about this but I don't found any solution, I think due to intrinsic differences between plain OCR and vectorial HOCR.

Giancarlo

tl;dr Sometimes I want things to happen that aren't possible, but I'm stubborn. And possibly naive.

Thanks for the input!



--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.

dp...@metro.org

unread,
May 10, 2017, 1:02:31 PM5/10/17
to islandora
Hi Folks, you always have such interesting conversations!

I found myself a few months ago (and still processing) with a similar problem. 22000 pages of PDFs with embedded /overlay text layers (hidden and/or rendered) generated by Abbyy which were already manually corrected because automatic OCR was failing bad because of the handwriting (annotations) over old smeary typewriter ink over old, dusty, grainy paper: a nightmare. Since we don't have access to enterprise-server-version of Abbyy, which provides Abbyy XML exporting capabilties via API!) We did not do the scanning, OCR and correcting ourselves so i had to do some manual tweaking and bear a lot of suffering.

My workflow:

1. http://www.unixuser.org/~euske/python/pdfminer/ in specific pdf2txt.py Takes a PDF, extracts in a kind of propietary XML and or HTML format with boundaries. It is hard to control since you have to tweak manually the letter and word separation to get correct grouping but you get something that makes sense, boundaries, letters and processable tags from a nasty PDF. nice!
2. Custom PHP file that parses that output (i like python but i love PHP) that checks again if pdf2txt.py did a good job grouping and regrouping, fixing single letters, etc. Someday i will share it, for now, it is too beta and i´m still running a batch of pdfs. But anyone can write that. It took me only a few hours. 
3. Alternatively ( skip 2 if you trust the pdf2txt.py output) an XSLT that takes pdf2txt propietary format and dumps an HOCR. Example of ABBYY XML to HOCR  https://gist.github.com/tfmorris/5977784 (WHY, WHY!!, I wish I had access to ABBYY XML instead of all that extra work)...

From there Giancarlo's workflow is the one you should follow. We on this side use our multi importer module, but its the same logic. Don´t reprocess OCR, just mash-up the format and ingest individual preprocessed datastreams.

Lastly. Has anyone tried Apache Tika lately?

Final comments:

The way we are dealing at islandora with HOCR is kinda complicated and not very optimal.E.g: on the extracting, searching part, we make a query to Solr to get the vector and highlights data (no bounding box there), then we load the HOCR datastream directly from the objects and iterate extracting the bounding box information for the matching terms, then we transform that info into something the IA book reader can read. The modules are involved in that, OCR, paged and IA book reader. What we really need at the IA reader level could be solved with any bounding box, word structure (we are tying it to HOCR, but it could be anything else). I don´t have time to devote too much extra time to the IA book reader, but if we could provide an alternative callback function or URL for searches (hook, admin provided, etc) we could use any other XML/HTML/image bound based word vector structure. HOCR is a cool standard, but too phew applications use it to make it a "widely used standard" . Tesseract is great (really), but if you need correcting and testing those corrections Abbyy ends beating just because of the UI tools and workflows, so the chances other not HOCR word vector standards are needed are high.

Diego Pino N
Metro.org

Danielle Robichaud

unread,
May 12, 2017, 9:08:37 AM5/12/17
to islandora
Hi everyone,

Thank you very much for all of the input - it's been incredibly helpful (even if I haven't understood everything)!

As a final comment I want to flag for anyone reading this, but not responding, that the Islandora Transcript work that Nick Ruest has done is another workaround for not getting exactly want you want but getting what you need - in this case some kind of Ctrl+A keyword searching in the full transcript/text file: https://github.com/yorkulibraries/islandora_transcript

There's an example if it in use here: https://digital.library.yorku.ca/yul-307926/letter-mrs-stepler-gordon-stepler-august-23-1916/transcript

Finally, the transcript approach highlighted in an Islandora Show and Tell blog post from 2015 (https://islandora.ca/content/islandora-show-and-tell-marsden-online-archive) has another interesting approach to surfacing full text transcriptions, though it looks like it may have been abandoned based on the current set up of the example used (http://marsdenarchive.otago.ac.nz/MS_0054_043#page/1/mode/1up)..?
Reply all
Reply to author
Forward
0 new messages