Re: [islandora-dev] text extraction causing illegal characters in FULL_TEXT datastream

91 views
Skip to first unread message

Peter Murray

unread,
Apr 10, 2013, 3:10:58 PM4/10/13
to island...@googlegroups.com
Ah, good find Gail! I went down this path last month and solved it by making a change to GSearch to exclude characters that XML considered invalid. The thread with Gert is here:

http://thread.gmane.org/gmane.comp.cms.fedora-commons.user/8433


Peter

On Apr 10, 2013, at 2:43 PM, Gail Lewis <glew...@gmail.com> wrote:
>
> We encountered a problem with FULL_TEXT datastreams that contain text extracted from a PDF.
>
> Some of our PDF objects have been generating this error when we attempt to re-index:
> SEVERE: [com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xc) not a valid XML character
>
> That's a form feed character. It's one of a number of characters illegal in XML that will break the indexing process.
>
> If I remove the part of the gsearch configuration that indexes FULL_TEXT datastreams and let the rest of the fields through, the object indexes fine. So it appears that there are illegal characters in the FULL_TEXT datastream.
>
> I took a look at the PDF module and it is using /usr/bin/pdftotext to extract text from the PDF. I don't see any processing of the text for illegal characters. It goes straight into the FULL_TEXT datastream.
>
> There's a parameter for pdftotext: -nopgbrk Don't insert page breaks (form feed characters) between pages.
>
> By default, pdftotext inserts form feed characters.
>
> So I modified a line to add the -nopgbrk parameter in the PDF module, file islandora_solution_pack_pdf/includes/derivatives.inc
>
> $executable = variable_get('islandora_pdf_path_to_pdftotext', '/usr/bin/pdftotext');
> $temp = drupal_tempnam("temporary://", "fulltext") . '.txt';
> $derivative_file_uri = drupal_realpath($temp);
> $command = "$executable -nopgbrk $source $derivative_file_uri";
> exec($command, $execout, $returncode);
>
> Since then, newly created PDF objects are indexing correctly.
>
> The code should probably be modified to exclude more possible illegal characters. For gsearch/solr to be happy, that's control characters 0-31, except 9, 10, and 13 (in decimal).


--
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
Peter....@lyrasis.org
+1 678-235-2955
800.999.8558 x2955


Gail Lewis

unread,
Apr 10, 2013, 4:44:48 PM4/10/13
to island...@googlegroups.com
I didn't think to get into the Gsearch code.  In your thread, I see Gsearch is spacing out the illegal characters at the end of getTextFromPDF().  But the FULL_TEXT datastream is going through getTextFromText(), which isn't doing that.  I wondered why the illegal characters were getting through getDatastreamText.

Peter Murray

unread,
Apr 10, 2013, 5:45:19 PM4/10/13
to island...@googlegroups.com
On Apr 10, 2013, at 4:44 PM, Gail Lewis <glew...@gmail.com> wrote:
>
> I didn't think to get into the Gsearch code. In your thread, I see Gsearch is spacing out the illegal characters at the end of getTextFromPDF(). But the FULL_TEXT datastream is going through getTextFromText(), which isn't doing that. I wondered why the illegal characters were getting through getDatastreamText.

The illegal characters are getting stored into the FULL_TEXT datastream, which is just the un-XML output of pdftotext. So they are already in a datastream and the problem occurs when that datastream gets, in effect, copied into the SOLR document. The default foxmlToSolrxslt is using Tiki to get content from arbitrary datastreams for the SOLR document:

https://github.com/lyrasis/gsearch/blob/lyr-master/FgsConfig/FgsConfigIndexTemplate/Solr/foxmlToSolr.xslt#L84-114

At least that is how I understand the flow…


Peter
Reply all
Reply to author
Forward
0 new messages