--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/39f2fe12-a6e6-490e-98bb-f04088485841%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I'm not sure what you mean by "embedded renderable text", but I it sounds different than the invisible text layer in my files, which is not 'renderable' as I understand the term. Do you mean that your PDFs do not consist of scanned pages?In any case, extracting that text from the PDF and ingesting it as a separate datastream does not seem to provide any help for the problem of having this text be searchable in IA Book Reader.
--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/89af76cb-e1d2-48fe-813c-c7e833277645%40googlegroups.com.
Hi,
I manage pdf/a (pdf scanned + OCR text) as book and I use this steps:
- pdftk + imagemagick to generate tiff, 1 file x page
- docsplit utility to extract text from pdf pages, 1 file x page
- prepare dir structure as needed by book batch ingesting (1 dir x page with OBJ.tif, OCR.txt, DC.xml, ...)
- batch ingest (see islandora book ingest module)
but ...
while OCR.txt is indexed by solr and used by simple or advanced
search block, IA uses HOCR datastream that at the moment is
generated by tesseract during ingesting derivatives generation, I
searched but I didn't found any way to generate HOCR from pdf/a
directly,
so I have a full-text search based on OCR datastream while IA search is based on HOCR datastream, at the moment this is ok for me.
Sorry for my confused explication ...
cheers
Giancarlo
--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/39f2fe12-a6e6-490e-98bb-f04088485841%40googlegroups.com.
Hi Danielle,
I'm not sure I understood, if the problem is replace current OCR datastream with external generated, you can achieve that with the powerful module CRUD (thanks thanks Mark) https://github.com/mjordan/islandora_datastream_crud.
We works with external generated OCR and tesseract HOCR, I know
could be differences between OCR and HOCR but take in account that
HOCR is used only for IAB internal search while Solr full-text
search relies on OCR datastream, so the result could be a good
full-text search with no IAB text highlighting.
For us (and for the users) is more important use IAB than avoid
this little differences.
Have a good day,
Giancarlo
--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/af31dee3-56bc-4535-969b-c876ef2a6d29%40googlegroups.com.
Hi Giancarlo,
Thanks for pointing me to CRUD - this is the my first time it's crossing my radar.
Boiling it right down: Not being able to replace the HOCR is the issue I'm trying to think through.
If, for example, we suppress the book pages from the search results (https://jira.duraspace.org/browse/ISLANDORA-1533), which we were planning on doing, people won't be able to find the page where their search term hit once they click into the IA Book view because the HOCR datastream is so poor. I've attached more screenshots to illustrate what I'm getting at.
Based on what I can tell there are three options:
1) Switch to the PDF model
2) Suppress the pages and live with people having to look for the term themselves.
3) Leave the pages in the search results so people are, at least, directed to the page where the term hit, even if they have to look for it themselves once they get there.
My biggest barrier is not having enough tech to attempt to fix these things, but understanding more than most admins so that I'm perpetually pretty sure something is possible, but I'm never completely sure if I'm aiming too high.
Am I completely off the mark in thinking that it's possible to develop the option, at ingest, to pick either the embedded text file(s) OR the Tesseract output as the basis for the HOCR datastream(s)? Because if I'm not, maybe this is something UW Library can take a stab at making it happen and push back out to the community.
tl;dr Sometimes I want things to happen that aren't possible, but I'm stubborn. And possibly naive.
Thanks for the input!
--
For more information about using this group, please read our Listserv Guidelines: http://islandora.ca/content/welcome-islandora-listserv
---
You received this message because you are subscribed to the Google Groups "islandora" group.
To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
Visit this group at https://groups.google.com/group/islandora.
To view this discussion on the web visit https://groups.google.com/d/msgid/islandora/1b7b4c03-0bb9-4261-9401-d4019fdc1469%40googlegroups.com.