Customized HTML Parsing

5 views
Skip to first unread message

B R

unread,
Aug 31, 2009, 9:59:24 AM8/31/09
to hounder
Hi Jorge,

I need to customize the parsing of HTML wherein I modify some elements
in the HTML DOM, as soon as the page is fetched, before they can be
processed further. I tried adding the customized parse-html plugin to
the nutch-site.xml as

<property>
<name>plugin.includes</name>
<value>protocol-http|scoring-opic|parse-html</value>
</property>

However, it does not seem to work. Could you guide me as to
configuring the same ?

Thanks.

Jorge Handl

unread,
Aug 31, 2009, 1:13:04 PM8/31/09
to hou...@googlegroups.com
Try setting the fetcher.parse property in conf/nutch-site.xml to "true".
- Jorge

B R

unread,
Sep 2, 2009, 6:44:24 AM9/2/09
to hounder
I tried setting the fetcher.parse property in conf/nutch-site.xml to
"true" - that causes Hounder to download all content-types including
images and mp3 files but web pages stopped getting fetched.

I understand that Hounder uses the Nutch fetcher to fetch the data
into a segment directory. After the data is fetched, Hounder processes
the content - parses the HTML content, runs the modules, adds the
outllnks to the next PageDB, etc. I wish to modify the out-links
before they are added to the PageDB.

Would it be possible to do this using :
1) Modifying the fetched HTML content using a Nutch plugin before it
is written into the segment directory
OR
2) Modifying the out-links using a module configured in the
crawler.properties ?

Thanks.

On Aug 31, 10:13 pm, Jorge Handl <jha...@gmail.com> wrote:
> Try setting the fetcher.parse property in conf/nutch-site.xml to "true".
> - Jorge
>

Jorge Handl

unread,
Sep 2, 2009, 4:16:47 PM9/2/09
to hou...@googlegroups.com
Using the nutch html parser seems to mess with the outlinks, in my tests I don't get any outlink at all.
So I'd add the url editing code in the com.flaptor.util.parser.HtmlParser class.

- Jorge

Jorge Handl

unread,
Sep 3, 2009, 3:13:09 PM9/3/09
to hounder
After thinking a bit more about this, I realized you can write a
Crawler module that changes the links before they are added to the new
pagedb. I haven't tried it, but something like this should work:

for (Link link : fetchdocument.getLinks()) {
String url = link.getUrl();
// ... modify the url in some way
link.setUrl(url);
}

- Jorge


On Sep 2, 5:16 pm, Jorge Handl <jha...@gmail.com> wrote:
> Using the nutch html parser seems to mess with the outlinks, in my tests I
> don't get any outlink at all.
> So I'd add the url editing code in the
> com.flaptor.util.parser.HtmlParser<http://code.google.com/p/flaptor-util/source/checkout>class.
>
> - Jorge

B R

unread,
Sep 4, 2009, 11:14:49 AM9/4/09
to hounder
I modified the com.flaptor.util.parser.HtmlParser class and it worked
well. I'll try out your suggestion of implementing it as a Crawler
module.

However, I did end up having to extract some specific HTML tags while
parsing the content. Let me try moving this functionality also into a
crawler module, along with the modification of the outlinks.

Thanks.

B R

unread,
Sep 5, 2009, 7:31:58 AM9/5/09
to hounder
Hi Jorge,

Using a Crawler module for modifying outlinks as well as for
extracting HTML content worked very well.

Thanks.

Jorge Handl

unread,
Sep 5, 2009, 7:35:07 AM9/5/09
to hou...@googlegroups.com
I'm glad it worked. Is your module generic enough to contribute it to the project?
- Jorge
Reply all
Reply to author
Forward
0 new messages