I tried setting the fetcher.parse property in conf/nutch-site.xml to
"true" - that causes Hounder to download all content-types including
images and mp3 files but web pages stopped getting fetched.
I understand that Hounder uses the Nutch fetcher to fetch the data
into a segment directory. After the data is fetched, Hounder processes
the content - parses the HTML content, runs the modules, adds the
outllnks to the next PageDB, etc. I wish to modify the out-links
before they are added to the PageDB.
Would it be possible to do this using :
1) Modifying the fetched HTML content using a Nutch plugin before it
is written into the segment directory
OR
2) Modifying the out-links using a module configured in the
crawler.properties ?
Thanks.
On Aug 31, 10:13 pm, Jorge Handl <
jha...@gmail.com> wrote:
> Try setting the fetcher.parse property in conf/nutch-site.xml to "true".
> - Jorge
>