Sitemap functionalities

Julien Nioche

unread,

Dec 3, 2009, 7:06:15 AM12/3/09

to crawler-commons

Hi guys,

I suppose we'd probably like to have resources to process sitemaps.
Frank McCown mentioned http://sitemap-parser.sourceforge.net/ on the
Nutch list earlier this year, did anyone manage to have a look at the
code? Anything worth reusing or shall we start from scratch?

Julien

kkrugler

unread,

Dec 3, 2009, 11:14:57 PM12/3/09

to crawler-commons

Hi Julien,

Haven't looked at it, sorry.

Given how simple I'm assuming the code would be, I'd say if sitemap-
parser isn't a slam dunk then we should just write our own.

On Dec 3, 4:06 am, Julien Nioche <digitalpeb...@googlemail.com> wrote:
> Hi guys,
>
> I suppose we'd probably like to have resources to process sitemaps.

> Frank McCown mentionedhttp://sitemap-parser.sourceforge.net/on the

Ken Krugler

unread,

Dec 4, 2009, 12:42:59 PM12/4/09

to crawler...@googlegroups.com

Hi Julien,

I took a quick look at it. Here's info from the README:

Java Sitemap Parser
http://sitemap-parser.sourceforge.net/

Contact: Frank McCown (fmc...@harding.edu)
http://www.harding.edu/fmccown/

Licensed under the Apache License, Version 2.0.

This project was created by Dr. Frank McCown's Search Engine Development students
in the Spring 2009 semester (Harding University, Searcy, AR, USA). We hope this
Sitemap Parser will enable other open source web crawlers to enhance their
crawling abilities.

The code has been tested on a number of websites that use Sitemaps, including
amazon.com. The HTTP code was borred from Nutch (http://lucene.apache.org/nutch/)
and can be easily substituted for other HTTP libraries.

For more information about the Sitemap Protocol, visit http://www.sitemaps.org/

The actual sitemap processing is more complex than I was expecting. I didn't know about all of the different options (plain, XML, Atom, RSS, index).

The code is reasonable, though could use some cleanup.

Since it's under the Apache license, seems like we could pull it in as a starting point. Where "we" means somebody like you, hopefully :)

-- Ken

--------------------------------------------

Ken Krugler

+1 530-210-6378

http://bixolabs.com

e l a s t i c w e b m i n i n g

Reply all

Reply to author

Forward