Kia ora,
I’ve been doing some data tidying in DSpace 5.8 (xmlui) in preparation for an upcoming migration to 7.4 – mostly directly in the database. A few days later I was alerted to a record https://researcharchive.lincoln.ac.nz/handle/10182/16202 which isn’t showing up either by searching on the title, or in the title/author/keyword browse indexes. The item has Anonymous READ permissions (and anyway the search/browse still doesn’t work when I’m logged in as an Administrator) so I assumed this was because I’d been lazy and neglected to run a re-index.
So overnight we ran a job [dspace] /bin/dspace index-discovery -b
expecting this would resolve the issue. But we’re still seeing the same problem.
Is there anything else that could be blocking it from being indexed?
Any other jobs we should run?
If I throw my hands up in despair and just go ahead with the migration, will that magically fix it? (This is not actually my preference for various reasons, but some days a little magic would be nice!)
Deborah
––––––––––––––––––––––––––––––––––
Deborah Fitchett (she/her) MLIS, RLIANZA
Associate University Librarian, Digital Scholarship
––––––––––––––––––––––––––––––––––
Learning, Teaching and Library – Te Whare Pūrākau
PO Box 85064, Lincoln University
Lincoln 7647, Christchurch, New Zealand
––––––––––––––––––––––––––––––––––
Lincoln University
Te Whare Wānaka o Aoraki
––––––––––––––––––––––––––––––––––
Thanks very much, Tim!
I’ve checked permissions for item/bundles/bitstreams are all Anon READ. The metadata looks normal too including the Really Important fields like dc.type.
When I try index-discovery -i [itemid] I get "Unrecognized option: -i"
But the dSpace log from when we ran index-discovery -b shows:
2023-08-03 02:29:40,261 ERROR org.dspace.discovery.SolrServiceImpl @ Error while writing item to discovery index: 10182/16202 message:org.apache.tika.exception.TikaException: Failed to parse an email message
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Failed to parse an email message
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.dspace.discovery.SolrServiceImpl.writeDocument(SolrServiceImpl.java:748)
at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:1429)
at org.dspace.discovery.SolrServiceImpl.indexContent(SolrServiceImpl.java:230)
at org.dspace.discovery.SolrServiceImpl.updateIndex(SolrServiceImpl.java:410)
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
We found a Tika troubleshooting page at Troubleshooting Tika - TIKA - Apache Software Foundation so it looks like for some reason Tika thinks it’s supposed to be parsing an email message. This was utterly bewildering because the bitstream files are just regular PDFs: they have PDF file extensions, the format is marked as Adobe PDF in DSpace, and they open successfully as PDFs in the browser/Adobe Reader…
but then I looked at the text that had been extracted for the search index and found in each of the problem cases it begins eg:
Received: 22 June 2022 | Revised: 16 April 2023 | Accepted: 26 April 2023
This refers to when the journal first received the submitted article, but I guess Tika is interpreting the “Received:” as the start of an email header!
Fortunately we can see in our DSpace 7 dev environment this issue isn’t arising, so we’ll just ignore the issue until we can complete our upgrade.
Deborah
From: DSpace Technical Support <dspac...@googlegroups.com>
Sent: Saturday, August 5, 2023 5:08 AM
To: DSpace Technical Support <dspac...@googlegroups.com>
Subject: [dspace-tech] Re: Item not showing in search/browse
|
Caution: This email originated from outside our organisation. Do not click links or open attachments unless you recognize the sender and know the content is safe. |
--
All messages to this mailing list should adhere to the Code of Conduct:
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
dspace-tech...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/dspace-tech/a8cd5679-9b3b-40de-9012-c63fe5752842n%40googlegroups.com.