GSearch not indexing certain documents

56 views
Skip to first unread message

biraj gauli

unread,
Aug 21, 2011, 4:59:27 PM8/21/11
to islandora
I am having problems indexing with GSearch. I have two simple content
models representing collections and items - agecon:collectionCModel
and agecon:itemCModel. Since we had to migrate a lot of old
collections and items , we transformed the old data to collection
FOXML and item FOXML and ingested it with the fedora ingest utility.
However, new ingests are handled though Islandora . Everything seems
to be working well except GSearch does not index the records of type
agecon:itemCModel that was ingested directly though the utilty. The
collections that were ingested though the utility are indexed however.
Both collections and items that are ingested through Islandora are
indexed. When I try to UpdateIndex fromPID in the GSearch web client,
I get this error:

Connection error (is Solr running at http://localhost:8080/solr/update
?): java.io.IOException: Server returned HTTP response code: 400 for
URL: http://localhost:8080/solr/update

Is anybody familiar with this issue? I dont know why I am getting
IOException , when I can find that object though the fedora web
interface.

Paul Pound

unread,
Aug 22, 2011, 8:16:14 AM8/22/11
to isla...@googlegroups.com
it seems like the xml that is being send to solr is not correct. This can happen if you try to index a pdf that has certain whitespace characters embedded in the text of the pdf (there are a few valid utf-8 characters that are not valid xml), there are workarounds for this. It could also be that the xslt is generating invalid xml somehow as well.

If you have an item that fails indexing and has a pdf I would try removing the pdf stream and then try reindexing again to see if the metadata gets indexed (maybe make a test object with a pdf from an object that currently does not get indexed). If it does then we can work on getting the pdfs indexed.

Thanks,
Paul

> --
> You received this message because you are subscribed to the Google Groups "islandora" group.
> To post to this group, send email to isla...@googlegroups.com.
> To unsubscribe from this group, send email to islandora+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/islandora?hl=en.
>

Serhiy Polyakov

unread,
Aug 22, 2011, 5:35:29 PM8/22/11
to isla...@googlegroups.com
I had that problem of indexing related to PDFs with "unusual"
characters. After installing latest GSearch problem was solved. Latest
GSearch includes newer PDFBox.

Serhiy

Reply all
Reply to author
Forward
0 new messages