Hyrax full-text indexing - can't get it working, what am I missing?

Kerchner, Daniel

unread,

Sep 19, 2017, 8:38:25 PM9/19/17

to samver...@googlegroups.com

I'm having some trouble getting full-text indexing to work, and I would welcome any thoughts as to what I might have missed. It is my understanding that this should work in Hyrax 1.0.4 without any additional configuration (correct me if that is not the case).

To isolate the issue, I created a brand new Hyrax 1.0.4 app, following https://github.com/samvera/hyrax/tree/v1.0.4 . To keep it simple, I created a "vanilla" work type — quite literally: rails generate hyrax:work VanillaWork

The only other change I made was to set config.active_job.queue_adapter = :inline in conflg/application.rb

I'm running in dev mode using the out-of-the-box solr_wrapper/fcrepo_wrapper, i.e. running rake hydra:server , and creating new works (and attaching files) through the UI.

Symptoms:

- Search *does* match words in works' metadata successfully
- Search *does* match filenames of works' files successfully
- Search DOES NOT match words in the text of PDF files
- Search DOES NOT match words in the text of DOCX files

- I notice that rake solr:reindex results in this error: "NameError: uninitialized constant ResolrizeJob"

I've also noted that:

- Thumbnail derivatives are being created successfully (and I also see the thumbnail jpegs under tmp/derivatives)
- soffice is installed and its path is config'ed in hyrax.rb
- Solr and Fedora look okay as far as I can tell.

I don't have the Solr/Tika skills to know where in Solr I would expect to see the indexed tokens resulting from full-text indexing of the PDF/Word docs, so I can't really verify that one way or the other.

I did read through https://groups.google.com/forum/#!topic/samvera-tech/vmguu7rKkjo but I didn't see anything that resolves the issue.

What else should I check? What might I be missing?

Thanks in advance!!

- Dan

Dan Kerchner
Senior Software Developer, Scholarly Technology Group
The George Washington University Libraries
Gelman Library
2130 H Street, NW
Washington, DC 20052
kerc...@gwu.edu

J Kim

unread,

Sep 20, 2017, 10:53:21 AM9/20/17

to samvera-tech

Hi Dan,

I was having the same issue as you, but the thread you referred to did help me. Did you try Jose's modification?

https://github.com/mlibrary/umrdr/pull/700/files

If you already tried it, my apologies! I just wanted to make sure you didn't over look it. To be fair, I've not tried docx, but pdfs get indexed for me with that mod.

Janice

Kerchner, Daniel

unread,

Sep 24, 2017, 1:58:51 PM9/24/17

to samver...@googlegroups.com

Thank you Janice. I tried Jose's modification again just for good measure, but unfortuntely I'm still not getting search results matching on terms appearing in the text of the PDFs.

Other ideas?

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech+unsubscribe@googlegroups.com.
To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.
To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/b0a44c17-7ce6-447b-ada6-b0d93c1de8c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jose Blanco

unread,

Sep 25, 2017, 10:13:02 AM9/25/17

to samver...@googlegroups.com

Did you
1. make the code change
2. load the work and file
3. do the search.

The change alone will not work with existing works.

-Jose

>> email to samvera-tech...@googlegroups.com.

>> To post to this group, send email to samver...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/samvera-tech.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/samvera-tech/b0a44c17-7ce6-447b-ada6-b0d93c1de8c9%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>

> --
> You received this message because you are subscribed to the Google Groups
> "samvera-tech" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to samvera-tech...@googlegroups.com.

> To post to this group, send email to samver...@googlegroups.com.
> Visit this group at https://groups.google.com/group/samvera-tech.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/samvera-tech/CAHhv_RHqK4CQtgWuAzX5ZtsyvVD3CZbCuX%2BiR5pNG0E4SiNpuw%40mail.gmail.com.

Kerchner, Daniel

unread,

Sep 25, 2017, 12:18:51 PM9/25/17

to samver...@googlegroups.com

Thanks Jose. I did indeed reload the work and file after making the change and restarting.

I don't know how to verify that the file is getting indexed properly, but from what I can tell, the search being sent to solr does include the all_text_timv parameter. Here's what I see in the solr log when I execute the search (my search term is "cutePDF"):

2017-09-25 16:09:44.141 INFO (qtp834133664-20) [ x:hydra-development] o.a.s.c.S.Request [hydra-development] webapp=/solr path=/select params={facet.field=human_readable_type_sim&facet.field=resource_type_sim&facet.field=creator_sim&facet.field=contributor_sim&facet.field=keyword_sim&facet.field=subject_sim&facet.field=language_sim&facet.field=based_near_sim&facet.field=publisher_sim&facet.field=file_format_sim&facet.field=member_of_collections_ssim&facet.field=generic_type_sim&qt=search&user_query=cutePDF&f.language_sim.facet.limit=6&f.file_format_sim.facet.limit=6&f.based_near_sim.facet.limit=6&f.publisher_sim.facet.limit=6&f.resource_type_sim.facet.limit=6&fq=&fq={!terms+f%3Dhas_model_ssim}VanillaWork,Collection&fq=-suppressed_bsi:true&fq=&fq=-suppressed_bsi:true&sort=score+desc,+system_create_dtsi+desc&rows=10&f.creator_sim.facet.limit=6&f.human_readable_type_sim.facet.limit=6&f.member_of_collections_ssim.facet.limit=6&q={!lucene}_query_:"{!dismax+v%3D$user_query}"+_query_:"{!join+from%3Did+to%3Dfile_set_ids_ssim}{!dismax+v%3D$user_query}"&f.contributor_sim.facet.limit=6&qf=title_tesim+description_tesim+keyword_tesim+subject_tesim+creator_tesim+contributor_tesim+publisher_tesim+based_near_tesim+language_tesim+date_uploaded_tesim+date_modified_tesim+date_created_tesim+rights_tesim+resource_type_tesim+format_tesim+identifier_tesim+file_format_tesim+all_text_timv&pf=title_tesim&facet=true&wt=json&f.keyword_sim.facet.limit=6&f.subject_sim.facet.limit=6} hits=0 status=0 QTime=12

>> email to samvera-tech+unsubscribe@googlegroups.com.

>> To post to this group, send email to samver...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/samvera-tech.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/samvera-tech/b0a44c17-7ce6-447b-ada6-b0d93c1de8c9%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "samvera-tech" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to samvera-tech+unsubscribe@googlegroups.com.

> To post to this group, send email to samver...@googlegroups.com.
> Visit this group at https://groups.google.com/group/samvera-tech.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/samvera-tech/CAHhv_RHqK4CQtgWuAzX5ZtsyvVD3CZbCuX%2BiR5pNG0E4SiNpuw%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech+unsubscribe@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAK%3DKc-umF9pEKUA%2BriHX_3zksfCKQ%3DoQx1S4aW7OphCJw5YFAA%40mail.gmail.com.

CAROLYN ANN COLE

unread,

Sep 25, 2017, 12:50:26 PM9/25/17

to samvera-tech

You can look at the model to see if the text is being extracted. The code below assumes a single file in the work. When I run the code below for a PDF in my Sufia 7.2 code I see the text from the pdf.

vw = VanillaWork.find(<id>)

fs = vw.file_sets.first

fs.extracted_text.content

This will verify for you is the issue is with the search, or with the extraction.

-- Carolyn

>> email to samvera-tech...@googlegroups.com.

>> To post to this group, send email to samver...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/samvera-tech.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/samvera-tech/b0a44c17-7ce6-447b-ada6-b0d93c1de8c9%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "samvera-tech" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to samvera-tech...@googlegroups.com.

> To post to this group, send email to samver...@googlegroups.com.
> Visit this group at https://groups.google.com/group/samvera-tech.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/samvera-tech/CAHhv_RHqK4CQtgWuAzX5ZtsyvVD3CZbCuX%2BiR5pNG0E4SiNpuw%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech...@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAK%3DKc-umF9pEKUA%2BriHX_3zksfCKQ%3DoQx1S4aW7OphCJw5YFAA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech...@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAHhv_RGQhFdOseJzh8RJoc-kAary6Hua%3DF9cYNDKscoQfz%2Bcdg%40mail.gmail.com.

Kerchner, Daniel

unread,

Sep 25, 2017, 2:45:26 PM9/25/17

to samver...@googlegroups.com

Thanks Carolyn, it's very helpful to now know where to see the extracted text in the model!

I think I hit on the issue as having to do with Solr reindexing.

First, I verified (by inspecting the model object) that in fact that extracted text is present in the model:

2.3.3 :014 > fs.extracted_text.content
=> "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMicrosoft Word - sample.pdf.docx\n\n\n \n\n \n\n \n\n \n\n \n\nThis PDF file was created using CutePDF. \n\nwww.cutepdf.com"

In reading a thread from a few months back (Janice, you'll remember this one)

https://groups.google.com/forum/#!searchin/samvera-tech/Reindex%7Csort:relevance/samvera-tech/mLTahubhGUg/0W4ySyq6EAAJ

I figured I'd try ActiveFedora::Base.reindex_everything , and that seemed to do the trick!

(Incidentally, I'm also wondering why I get an error when I try to use the solr:reindex rake task:

$ rake solr:reindex
rake aborted!
NameError: uninitialized constant ResolrizeJob
/home/ubuntu/.rvm/gems/ruby-2.3.3/gems/hyrax-1.0.4/lib/tasks/reindex.rake:4:in `block (2 levels) in <top (required)>'
/home/ubuntu/.rvm/gems/ruby-2.3.3/gems/rake-12.1.0/exe/rake:27:in `<top (required)>'
/home/ubuntu/.rvm/gems/ruby-2.3.3/bin/ruby_executable_hooks:15:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.3.3/bin/ruby_executable_hooks:15:in `<main>'
Tasks: TOP => solr:reindex

)

Just for good measure, I verified that my .solr_wrapper file contains
collection:
persist: true

so that wouldn't appear to be causing any issues.

Main question: Is there something that I need to be doing or setting for "vanilla" Hyrax to be invoking solr to reindex itself, or at least to index newly created documents (particularly the extracted text)? I thought that the create_derivatives job should already trigger a reindex if needed?

Thanks again!!

- Dan

>> email to samvera-tech+unsubscribe@googlegroups.com.

>> To post to this group, send email to samver...@googlegroups.com.
>> Visit this group at https://groups.google.com/group/samvera-tech.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/samvera-tech/b0a44c17-7ce6-447b-ada6-b0d93c1de8c9%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "samvera-tech" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to samvera-tech+unsubscribe@googlegroups.com.

> To post to this group, send email to samver...@googlegroups.com.
> Visit this group at https://groups.google.com/group/samvera-tech.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/samvera-tech/CAHhv_RHqK4CQtgWuAzX5ZtsyvVD3CZbCuX%2BiR5pNG0E4SiNpuw%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech+unsubscribe@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAK%3DKc-umF9pEKUA%2BriHX_3zksfCKQ%3DoQx1S4aW7OphCJw5YFAA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech+unsubscribe@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAHhv_RGQhFdOseJzh8RJoc-kAary6Hua%3DF9cYNDKscoQfz%2Bcdg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech+unsubscribe@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/368557355.3418618.1506358223748.JavaMail.zimbra%40psu.edu.

Joseph Atzberger

unread,

Sep 25, 2017, 5:10:11 PM9/25/17

to samver...@googlegroups.com

That doesn't look verified to, but the opposite. That is the extracted text I would expect from an empty document. Just the filename and the CutePDF tag.

--Joe

From: samver...@googlegroups.com <samver...@googlegroups.com> on behalf of Kerchner, Daniel <kerc...@email.gwu.edu>
Sent: Monday, September 25, 2017 11:44:50 AM
To: samver...@googlegroups.com
Subject: Re: [samvera-tech] Re: Hyrax full-text indexing - can't get it working, what am I missing?

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech...@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAHhv_RENt7QbYRn_qHGhxdnatW%2Bx2L%2B0HQpgWaUP1OxHf09aRg%40mail.gmail.com.

Kerchner, Daniel

unread,

Sep 25, 2017, 5:36:47 PM9/25/17

to samver...@googlegroups.com

Just to be sure, I created another document with more recognizeable text, and I think the text extraction looks okay. I also see the object in the Fedora repo for the extracted text (that has, among other type tags, http://pcdm.org/use#ExtractedText ).

In this new test case, I get:

2.3.3 :005 > fs.extracted_text
=> #<Hydra::PCDM::File uri="http://127.0.0.1:8984/rest/dev/gq/67/jr/16/gq67jr16q/files/4cde4b63-6d0a-4bd6-bbe5-ee09c941a230" >

2.3.3 :006 > fs.extracted_text.content
=> "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMicrosoft Word - Document1\n\n\nThis\tis\tthe\tbody\ttext\tof\ta\tnew\tMicrosoft\tWord\tdocument,\twhich\tI’m\tgoing\tto\tsave\tas\tPDF\tfor\t\ntesting\tpurposes.\t\tIt\tcontains\teasily\tidentifiable\tsearch\tterms\tsuch\tas\telephant,\tTesla,\tand\t\npersnickety."

I also verified that Jose's modification to catalog_controller.rb is in fact needed for the search to work.

Hopefully this example is a bit easier for verifying that the extraction is working. So, the remaining question I have is around why solr reindexing isn't happening automatically.

I do wonder whether my using
config.active_job.queue_adapter = :inline has something to do with it - perhaps a reindexing job is being queued but not executed. I'll try setting up Sidekiq and see if that resolves it.

- Dan

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech+unsubscribe@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAHhv_RENt7QbYRn_qHGhxdnatW%2Bx2L%2B0HQpgWaUP1OxHf09aRg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "samvera-tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech+unsubscribe@googlegroups.com.
To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/MWHPR02MB24625745D62724C7E5E251FAD87A0%40MWHPR02MB2462.namprd02.prod.outlook.com.

Scott Kushner

unread,

Sep 26, 2017, 4:31:08 PM9/26/17

to samvera-tech

Is your pdf an image or ocr readable?

>> email to samvera-tec...@googlegroups.com.
>> To post to this group, send email to samve...@googlegroups.com.

>> Visit this group at https://groups.google.com/group/samvera-tech.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/samvera-tech/b0a44c17-7ce6-447b-ada6-b0d93c1de8c9%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "samvera-tech" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to samvera-tec...@googlegroups.com.
> To post to this group, send email to samve...@googlegroups.com.

> Visit this group at https://groups.google.com/group/samvera-tech.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/samvera-tech/CAHhv_RHqK4CQtgWuAzX5ZtsyvVD3CZbCuX%2BiR5pNG0E4SiNpuw%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech...@googlegroups.com.
To post to this group, send email to samve...@googlegroups.com.

Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAK%3DKc-umF9pEKUA%2BriHX_3zksfCKQ%3DoQx1S4aW7OphCJw5YFAA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech...@googlegroups.com.
To post to this group, send email to samve...@googlegroups.com.

Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAHhv_RGQhFdOseJzh8RJoc-kAary6Hua%3DF9cYNDKscoQfz%2Bcdg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech...@googlegroups.com.
To post to this group, send email to samve...@googlegroups.com.

Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/368557355.3418618.1506358223748.JavaMail.zimbra%40psu.edu.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech...@googlegroups.com.
To post to this group, send email to samve...@googlegroups.com.

Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAHhv_RENt7QbYRn_qHGhxdnatW%2Bx2L%2B0HQpgWaUP1OxHf09aRg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "samvera-tech" group.

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech...@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

Kerchner, Daniel

unread,

Sep 28, 2017, 12:17:04 PM9/28/17

to samver...@googlegroups.com

Hi Scott,

My PDF contains extractable text (i.e. it is not an image), and the extraction seems to have succeeded. The issue seems to be that solr reindexing is not occurring when a new document is uploaded, so I'm having to explicitly invoke solr reindexing using ActiveFedora::Base.reindex_everything.

- Dan

To unsubscribe from this group and stop receiving emails from it, send an email to samvera-tech+unsubscribe@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/1957f3df-889b-415e-aef9-0db40981b284%40googlegroups.com.

Scott Kushner

unread,

Sep 28, 2017, 12:46:16 PM9/28/17

to samver...@googlegroups.com

That's strange.

The only thing that I can think of is do you have fits.sh or redis-server configured correctly?

You received this message because you are subscribed to a topic in the Google Groups "samvera-tech" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/samvera-tech/ogBCi1Gf6B8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to samvera-tech+unsubscribe@googlegroups.com.

To post to this group, send email to samver...@googlegroups.com.
Visit this group at https://groups.google.com/group/samvera-tech.

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/CAHhv_RGbdN25j4AL-u9Cp6fVODDVvZjrdmY%3Dr%3DGv2WZz4%3DS-qA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Scott Kushner

Systems and Emerging Technologies Librarian

Saint Peter's University
The Jesuit University of New Jersey

Theresa & Edward O'Toole Library
99 Glenwood Ave.
Jersey City, New Jersey 07306

p: (201) 761-6456
f: (201) 761-6451

www.saintpeters.edu/library

janice.j...@gmail.com

unread,

Oct 25, 2017, 9:37:58 PM10/25/17

to samvera-tech

Hi Dan,

I'm seeing this behavior now in my application (hyrax 1.0.4), though I'm fairly certain it was working for me before without having to reindex. I was wondering if you found out what was causing your issue. I'm hoping it'll help me out as well.

Janice

On Thursday, September 28, 2017 at 12:17:04 PM UTC-4, Dan Kerchner wrote:
> Hi Scott,
>
> My PDF contains extractable text (i.e. it is not an image), and the extraction seems to have succeeded. The issue seems to be that solr reindexing is not occurring when a new document is uploaded, so I'm having to explicitly invoke solr reindexing using ActiveFedora::Base.reindex_everything.
>
> - Dan
>
>
>
>
>
> On Tue, Sep 26, 2017 at 4:31 PM, Scott Kushner <skus...@saintpeters.edu> wrote:
>
> Is your pdf an image or ocr readable?
>
> On Monday, September 25, 2017 at 5:36:47 PM UTC-4, Dan Kerchner wrote:
>
>
> Just to be sure, I created another document with more recognizeable text, and I think the text extraction looks okay. I also see the object in the Fedora repo for the extracted text (that has, among other type tags, http://pcdm.org/use#ExtractedText ).
>
> In this new test case, I get:
>
> 2.3.3 :005 > fs.extracted_text
> => #<Hydra::PCDM::File uri="http://127.0.0.1:8984/rest/dev/gq/67/jr/16/gq67jr16q/files/4cde4b63-6d0a-4bd6-bbe5-ee09c941a230" >
>
>
> 2.3.3 :006 > fs.extracted_text.content
> => "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMicrosoft Word - Document1\n\n\nThis\tis\tthe\tbody\ttext\tof\ta\tnew\tMicrosoft\tWord\tdocument,\twhich\tI’m\tgoing\tto\tsave\tas\tPDF\tfor\t\ntesting\tpurposes.\t\tIt\tcontains\teasily\tidentifiable\tsearch\tterms\tsuch\tas\telephant,\tTesla,\tand\t\npersnickety."
>
> I also verified that Jose's modification to catalog_controller.rb is in fact needed for the search to work.
>
>
> Hopefully this example is a bit easier for verifying that the extraction is working. So, the remaining question I have is around why solr reindexing isn't happening automatically.
>
>
> I do wonder whether my using
> config.active_job.queue_adapter = :inline has something to do with it - perhaps a reindexing job is being queued but not executed. I'll try setting up Sidekiq and see if that resolves it.
>
>
>
> - Dan
>
>
>
>
> On Mon, Sep 25, 2017 at 5:10 PM, Joseph Atzberger <a...@stanford.edu> wrote:
>
>
>
>
>
>
>
>
>
>
> That doesn't look verified to, but the opposite. That is the extracted text I would expect from an empty document. Just the filename and the CutePDF tag.
>
>
>
>
>
> --Joe
>
>
>

J Kim

unread,

Oct 30, 2017, 2:32:18 PM10/30/17

to samvera-tech

Follow-up a little. I finally started using the Solr Admin UI to understand more about what I'm actually getting indexed.

There seems to be a difference in Solr when I compare a newly created FileSet (from a newly created work without running ActiveFedora::Base.reindex_everything) and an existing FileSet after a call to ActiveFedora::Base.reindex_everything.

Here's how I'm viewing all of my file sets:

http://localhost:8983/solr/hydra-development/select?fq=has_model_ssim:%22FileSet%22&indent=on&q=*:*&wt=json

The newly created file set is missing the following fields:

        "file_format_tesim":["pdf (Portable Document Format)"],
        "file_size_lts":1373814,
        "mime_type_ssi":"application/pdf",
        "digest_ssim":["urn:sha1:c68fc341e4cabead8e523cf8340c250d376cbb73"],
        "page_count_tesim":["23"],
        "file_title_tesim":["Thank your for completing the Innovation Survey.",
          "Local Disk"],
        "original_checksum_tesim":["8b5c0cc9db236cc61b9891ea59f8cd34"],

And, I'm wondering if maybe some processing is being skipped when I create my work. I'll keep posting as I find out more. (There is also the possibility that this is a self-inflicted wound, caused by my own changes.)

Janice

J Kim

unread,

Nov 2, 2017, 3:03:42 PM11/2/17

to samvera-tech

I feel like I'm close but not quite there. When I try to save a new work with an uploaded PDF file, I hit the FileSetIndexer's generate_solr_document function a bunch of times. The time right before the last one, I can see my FileSet indexed properly via the Solr Admin UI, including the missing values for my FileSet. However, during the final call to generate_solr_document, the ``object.extracted_text`` is nil, and the stuff that should be stored in in the solr document gets wiped out. I need to find out what's making that call to the FileSetIndexer, but my debugging skills are a bit lacking. If anyone has tips on how I can narrow down what's calling the FileSetIndexer, I'd appreciate it. My debugging skills are lacking here.

Janice

J Kim

unread,

Nov 3, 2017, 10:11:51 AM11/3/17

to samvera-tech

I've found out where my FileSet solr index is getting messed up.

https://github.com/samvera/hyrax/blob/4afb5524a6c259d1312f88c157431e77350fcb5f/app/actors/hyrax/actors/file_set_actor.rb#L49

The AttachFilesToWorkJob calls FileSetActor's attach_file_to_work method, and the following line causes the solr index to get updated (which I don't understand fully... but anyway):

work.ordered_members<<file_set

In my case, the file_set.to_solr doesn't have all of the fields set anymore, so the index gets updated with an incomplete solr document for that FileSet, which explains why the pdf that I'm testing with doesn't get indexed and doesn't appear in the search.

Has anyone seen anything like this before? I haven't touched these jobs, so it's hard for me to think where I could have messed up the code to do this. However, it seems strange that not more people are seeing this.

Also, I don't understand why the update to the solr index for the FileSet is happening right there anyway. I would expect that at that point, you already have the FileSet indexed in Solr, and that you only want the work to get updated with a pointer to the FileSet.

Janice

J Kim

unread,

Nov 3, 2017, 2:15:37 PM11/3/17

to samvera-tech

I talked with LaRita in the dev slack channel, and she helped me understand things a little better. I have at least something that works now, but I can't help but feeling like there's a better solution out there somewhere.

I ended up going into my generated works controller and adding the following two lines to both my create and update methods:

curation_concern.reload #Get extracted_text loaded properly for each of the curation_concern.members
curation_concern.members.each{|member| member.update_index} # Reindex Solr

What I was finding was that, in FileActorSet.attach_file_to_work, the call to work.reload (link below) wasn't loading the extracted_text. So, when the solr index got updated a few lines later, it wasn't indexed properly.

https://github.com/samvera/hyrax/blob/4afb5524a6c259d1312f88c157431e77350fcb5f/app/actors/hyrax/actors/file_set_actor.rb#L47

I feel like the solution is that the extracted_text needs to be saved somewhere before we get to the point above, but I don't know where to do that yet. For now, I'm going with my work_around until I have more time to look further. I still don't know if I did something to introduce this bug, or if it existed beforehand. Anyway, enough rambling.

lrob...@nd.edu

unread,

Nov 8, 2017, 11:26:39 AM11/8/17

to samvera-tech

I think that this is just a symptom of how indexing works in general. I'm planning to add documentation on indexing to our github.io site when I get time, and I'm hoping to come up with a better solution as I work on documenting what I know and digging into a few pieces that I don't know as part of that documentation.

CAROLYN ANN COLE

unread,

Nov 8, 2017, 11:51:56 AM11/8/17

to samver...@googlegroups.com

Hi!

If you are in byebug the ‘where’ command shows the backtrace so you can see who is calling the method. You can use the ‘up’ command to go up the stack also

-- Carolyn

To view this discussion on the web visit https://groups.google.com/d/msgid/samvera-tech/0d01c84a-0021-4f23-9196-1e2a281d0523%40googlegroups.com.

J Kim

unread,

Nov 8, 2017, 12:00:27 PM11/8/17

to samvera-tech

That would be awesome. Thanks!

J Kim

unread,

Nov 8, 2017, 12:01:45 PM11/8/17

to samvera-tech

Geez. Somehow, I missed that from the documentation. I was stepping and continuing like an idiot trying to find out where I was. Thanks! It'll be helpful going forward.

Reply all

Reply to author

Forward