Hello,I am constantly getting pdf parsing exception when processing nutch crawlDB with Tika:2013-02-04 21:09:08,293 WARN org.apache.pdfbox.pdfparser.BaseParser: Invalid dictionary, found: '�' but expected: '/'2013-02-04 21:09:08,293 WARN org.apache.pdfbox.pdfparser.XrefTrailerResolver: Did not found XRef object at specified startxref position 02013-02-04 21:09:24,428 WARN org.apache.pdfbox.pdfparser.PDFParser: Parsing Error, Skipping Objectjava.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@134683c0at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:597)at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)I use the following command to run Tika processing:/home/nutch/hadoop/bin/hadoop jar /home/nutch/hadoop/behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -D tika.convert.markup=false -i /dmozSportDepth2_behemothCorpus -o /dmozSportDepth2_tika -m text/htmlI thought that setting --mimeType parameter to "text/html" will force parsing only html documents but pdf's are still parsed. Suppose some of them are corrupted that's why i get an exception and finally the job fails.Can you please tell if there is any way to skip pdf documents processing (although they exist in nutch crawlDB) with Tika? Or maybe there is any way to disable only pdf parsing in Tika plugins configuration?Thanks a lot--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.
Some modules like the Nutch one in IO or the CommonCrawl one generate BehemothDocument with a mimetype, which is the one that the webservers set. This can be over-ridden by the value guessed by Tika (not by default) or is otherwise used by Tika as a clue to determine which parser to use. So yes, you can filter based on the mimetype prior to calling the Tika module
Julien