Crawler issue

4 views
Skip to first unread message

Bilford

unread,
Aug 17, 2009, 2:55:40 PM8/17/09
to hounder
I have an issue with the crawler that I'm sure is an easy fix but I'm
not sure where to look. I keep getting:

java.lang.IllegalArgumentException: Adding text to an XML document
must not be null


in $BASEDIR/crawler/log/hounder.log over and over and over. It seems
to be only for PDF files. Is there something I have to do to turn on
PDF indexing?

Thanks in advance,


Billford

Jorge Handl

unread,
Aug 18, 2009, 12:50:58 PM8/18/09
to hou...@googlegroups.com
Bill, you can either filter PDF file through the regex-urlfilter.txt file or add the parser in the plugin.includes property in the nutch-site.xml file.
- Jorge

Bilford

unread,
Aug 18, 2009, 3:41:19 PM8/18/09
to hounder
Thanks Jorge

On Aug 18, 12:50 pm, Jorge Handl <jha...@gmail.com> wrote:
> Bill, you can either filter PDF file through the regex-urlfilter.txt file or
> add the parser in the plugin.includes property in the nutch-site.xml file.
> - Jorge
>
Reply all
Reply to author
Forward
0 new messages