Re: Skip Tika pdf parsing

406 views
Skip to first unread message

DigitalPebble

unread,
Feb 4, 2013, 3:37:20 PM2/4/13
to digita...@googlegroups.com
Hi,

The -m parameter in the Tika module is used to force the mime-type which is given as a clue to Tika. It won't prevent documents to be parsed. 
The easiest way to do what you are after is to use the CorpusFilter command (see https://github.com/DigitalPebble/behemoth/wiki/Core-module) with a positive filter like  -D document.filter.mimetype.keep=application/html 

There is no negative filter for the mimetypes in https://github.com/DigitalPebble/behemoth/blob/master/core/src/main/java/com/digitalpebble/behemoth/DocumentFilter.java but we could add one. You can then call the tika command on the filtered corpus.

Note that some modules do filter their outputs based on the filtering parameters but this is not the case for the Nutch importer yet.

Another option is to change the parameter in Hadoop  so that the MapReduce jobs skip failed entries, can't remember what it is called but it should be easy to find out.

Julien



On 4 February 2013 18:41, skillptor <skil...@gmail.com> wrote:
Hello,
I am constantly getting pdf parsing exception when processing nutch crawlDB with Tika:

2013-02-04 21:09:08,293 WARN org.apache.pdfbox.pdfparser.BaseParser: Invalid dictionary, found: '�' but expected: '/'
2013-02-04 21:09:08,293 WARN org.apache.pdfbox.pdfparser.XrefTrailerResolver: Did not found XRef object at specified startxref position 0
2013-02-04 21:09:24,428 WARN org.apache.pdfbox.pdfparser.PDFParser: Parsing Error, Skipping Object
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@134683c0
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:597)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)

I use the following command to run Tika processing:

/home/nutch/hadoop/bin/hadoop jar /home/nutch/hadoop/behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -D tika.convert.markup=false -i /dmozSportDepth2_behemothCorpus -o /dmozSportDepth2_tika -m text/html

I thought that setting --mimeType parameter to "text/html" will force parsing only html documents but pdf's are still parsed. Suppose some of them are corrupted that's why i get an exception and finally the job fails.

Can you please tell if there is any way to skip pdf documents processing (although they exist in nutch crawlDB) with Tika? Or maybe there is any way to disable only pdf parsing in Tika plugins configuration?

Thanks a lot

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To post to this group, send an email to digita...@googlegroups.com.
Visit this group at http://groups.google.com/group/digitalpebble?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
 
Open Source Solutions for Text Engineering
 
http://digitalpebble.blogspot.com
http://www.digitalpebble.com

glmVSV

unread,
Feb 4, 2013, 3:44:46 PM2/4/13
to digita...@googlegroups.com, jul...@digitalpebble.com
Is it possible to use this filter before Tika step?
I thought that Tika module defines the mimetype of the document and the CorpusFilter command should be applied after Tika step...

skillptor

unread,
Feb 5, 2013, 1:47:21 PM2/5/13
to digita...@googlegroups.com, jul...@digitalpebble.com
Hi Julien.
Thanks a lot for your great advice. I ran Tika processing with the following command (skipping failed entries)

/home/nutch/hadoop/bin/hadoop jar /home/nutch/hadoop/behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -D tika.convert.markup=false -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 -i /dmozSportDepth2_behemothCorpus -o /dmozSportDepth2_tika -m text/html

And my Tika processing job completed successfully. Thanks again.

понедельник, 4 февраля 2013 г., 23:37:20 UTC+3 пользователь DigitalPebble написал:

DigitalPebble

unread,
Feb 5, 2013, 3:30:37 PM2/5/13
to digita...@googlegroups.com

Some modules like the Nutch one in IO or the CommonCrawl one generate BehemothDocument with a mimetype, which is the one that the webservers set. This can be over-ridden by the value guessed by Tika (not by default) or is otherwise used by Tika as a clue to determine which parser to use. So yes, you can filter based on the mimetype prior to calling the Tika module

Julien

DigitalPebble

unread,
Feb 5, 2013, 3:40:32 PM2/5/13
to digita...@googlegroups.com
Hi, 

Glad you managed to get it to work. I have added a new issue https://github.com/DigitalPebble/behemoth/issues/42 as a reminder that I need to add a negative filter for the MT.

BTW I see you are doing Nutch +Tika on Behemoth. Are you using Tika only to retrieve the text from the binary content? If so you could modify the code in the NutchSegmentConverterJob to get the text from Nutch as well so that if the segment''s been parsed (e.g. with Tika) you wouldn't have to reparse it in Behemoth. Unless of course you want to store the markup as annotations for further processing

J.
Reply all
Reply to author
Forward
0 new messages