sitemap-parser in crawler-commons

61 views
Skip to first unread message

DigitalPebble

unread,
Apr 22, 2010, 10:59:03 AM4/22/10
to crawler...@googlegroups.com, fmc...@harding.edu
Guys,

I've made an initial port of Frank Mc Cown's Sitemap Parser (see discussion on http://groups.google.com/group/crawler-commons/browse_thread/thread/72dae4fd084eb6d8?pli=1) into crawler commons. I had an email exchange with Frank recently and he kindly agreed that we used his code in CC. 

It is a simplified version of Frank's initial code and I've made quite a bit of refactoring beside adding the license headers, replacing calls to System with logging etc... This code does only parsing i.e. the fetching of the content is currently left to the client but we could of course link that to a Protocol implementation when we have one. 

Feel free to comment. If no one objects I will commit this into SVN at some point next week. 

Thanks

Julien

--
DigitalPebble Ltd
Open Source Solutions for Text Engineering
http://www.digitalpebble.com

DigitalPebble

unread,
Apr 22, 2010, 11:02:42 AM4/22/10
to crawler...@googlegroups.com, fmc...@harding.edu
Always better with the attachment
sitemaps.patch

Andrzej Bialecki

unread,
Apr 22, 2010, 11:29:02 AM4/22/10
to crawler...@googlegroups.com
On 2010-04-22 17:02, DigitalPebble wrote:
> Always better with the attachment
>
>
Patch looks good! Two minor issues:

* SimpleDateFormat is not thread-safe, yet the SiteMap uses a static
instance. I think it should use ThreadLocal-s instead.

* SiteMapURL.java and UnknownFormatException.java use \r\n as EOL-s,
unlike the rest of the patch.

Best regards,
Andrzej


--
Subscription settings: http://groups.google.com/group/crawler-commons/subscribe?hl=en

Julien Nioche

unread,
Apr 22, 2010, 5:02:04 PM4/22/10
to crawler-commons
Hi Andrzej,

Well spotted! I have amended the patch accordingly and will commit
within the next couple of days (unless someone objects).

Thanks

Julien

Ken Krugler

unread,
Nov 8, 2010, 12:37:40 PM11/8/10
to crawler...@googlegroups.com
Hi Julien,

Random thought - would this be better implemented as a Tika parser?

For XML-based formats, it would be easy enough to detect that it's a
sitemap - and that appears to be the dominant use case, for index and
regular sitemap files.

With this approach, there's the issue of how to communicate per-URL
meta-data such as lastmod, changefreq and priority in an XHTML 1.0-
compatible format. xmlns?

Side note - for the plain text version, I'm thinking it would be a
useful extension to modify the TXTParser to auto-detect and extact
URLs...then that would work fine.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g

Reply all
Reply to author
Forward
0 new messages