Sitemap functionalities

18 views
Skip to first unread message

Julien Nioche

unread,
Dec 3, 2009, 7:06:15 AM12/3/09
to crawler-commons
Hi guys,

I suppose we'd probably like to have resources to process sitemaps.
Frank McCown mentioned http://sitemap-parser.sourceforge.net/ on the
Nutch list earlier this year, did anyone manage to have a look at the
code? Anything worth reusing or shall we start from scratch?

Julien

kkrugler

unread,
Dec 3, 2009, 11:14:57 PM12/3/09
to crawler-commons
Hi Julien,

Haven't looked at it, sorry.

Given how simple I'm assuming the code would be, I'd say if sitemap-
parser isn't a slam dunk then we should just write our own.

On Dec 3, 4:06 am, Julien Nioche <digitalpeb...@googlemail.com> wrote:
> Hi guys,
>
> I suppose we'd probably like to have resources to process sitemaps.
> Frank McCown mentionedhttp://sitemap-parser.sourceforge.net/on the

Ken Krugler

unread,
Dec 4, 2009, 12:42:59 PM12/4/09
to crawler...@googlegroups.com
Hi Julien,

I took a quick look at it. Here's info from the README:

Java Sitemap Parser

Contact: Frank McCown (fmc...@harding.edu)

Licensed under the Apache License, Version 2.0.

This project was created by Dr. Frank McCown's Search Engine Development students
in the Spring 2009 semester (Harding University, Searcy, AR, USA).  We hope this 
Sitemap Parser will enable other open source web crawlers to enhance their 
crawling abilities.  

The code has been tested on a number of websites that use Sitemaps, including 
amazon.com.  The HTTP code was borred from Nutch (http://lucene.apache.org/nutch/)
and can be easily substituted for other HTTP libraries.

For more information about the Sitemap Protocol, visit http://www.sitemaps.org/

The actual sitemap processing is more complex than I was expecting. I didn't know about all of the different options (plain, XML, Atom, RSS, index).

The code is reasonable, though could use some cleanup.

Since it's under the Apache license, seems like we could pull it in as a starting point. Where "we" means somebody like you, hopefully :)

-- Ken


--------------------------------------------
Ken Krugler
e l a s t i c   w e b   m i n i n g




Reply all
Reply to author
Forward
0 new messages