Ah, good find Gail! I went down this path last month and solved it by making a change to GSearch to exclude characters that XML considered invalid. The thread with Gert is here:
http://thread.gmane.org/gmane.comp.cms.fedora-commons.user/8433
Peter
On Apr 10, 2013, at 2:43 PM, Gail Lewis <
glew...@gmail.com> wrote:
>
> We encountered a problem with FULL_TEXT datastreams that contain text extracted from a PDF.
>
> Some of our PDF objects have been generating this error when we attempt to re-index:
> SEVERE: [com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xc) not a valid XML character
>
> That's a form feed character. It's one of a number of characters illegal in XML that will break the indexing process.
>
> If I remove the part of the gsearch configuration that indexes FULL_TEXT datastreams and let the rest of the fields through, the object indexes fine. So it appears that there are illegal characters in the FULL_TEXT datastream.
>
> I took a look at the PDF module and it is using /usr/bin/pdftotext to extract text from the PDF. I don't see any processing of the text for illegal characters. It goes straight into the FULL_TEXT datastream.
>
> There's a parameter for pdftotext: -nopgbrk Don't insert page breaks (form feed characters) between pages.
>
> By default, pdftotext inserts form feed characters.
>
> So I modified a line to add the -nopgbrk parameter in the PDF module, file islandora_solution_pack_pdf/includes/derivatives.inc
>
> $executable = variable_get('islandora_pdf_path_to_pdftotext', '/usr/bin/pdftotext');
> $temp = drupal_tempnam("temporary://", "fulltext") . '.txt';
> $derivative_file_uri = drupal_realpath($temp);
> $command = "$executable -nopgbrk $source $derivative_file_uri";
> exec($command, $execout, $returncode);
>
> Since then, newly created PDF objects are indexing correctly.
>
> The code should probably be modified to exclude more possible illegal characters. For gsearch/solr to be happy, that's control characters 0-31, except 9, 10, and 13 (in decimal).
--
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
Peter....@lyrasis.org
+1 678-235-2955
800.999.8558 x2955