How to crawl an entire website

Altruist

unread,

Jan 8, 2013, 12:13:28 AM1/8/13

to ldsp...@googlegroups.com

Hi All,

I was wondering in LDSpider could be used ot crawl an entire website (i.e all the links) like bbc.co.uk using the sitemap provided in robots.txt and gather the triples from that crawl. If i give the seed URI as bbc.co.uk I do not get any triples back. LDSpider does not seem to follow the robots.txt and go to sitemap.xml to crawl all the links in the sitemap.

Can this be accomplished using LDSpider?

I am suing a breadth first crawling strategy.I understand that the kind of crawling that I am trying to do is not neccessarily LOD in nature but I want to crawl multiple sites and intend to leverage the LDSpider because it can be hooked up with Any23Handler .

Thank You

Andreas Harth

unread,

Jan 8, 2013, 7:33:53 AM1/8/13

to ldsp...@googlegroups.com

Hi,

On 08/01/13 06:13, Altruist wrote:
> I was wondering in LDSpider could be used ot crawl an entire website
> (i.e all the links) like bbc.co.uk using the sitemap provided in
> robots.txt and gather the triples from that crawl. If i give the seed
> URI as bbc.co.uk I do not get any triples back. LDSpider does not seem
> to follow the robots.txt and go to sitemap.xml to crawl all the links in
> the sitemap.
>
> Can this be accomplished using LDSpider?

you'd need to create a suitable seeds file and use the -y option (stay
on hostnames of seed URIs) or the equivalent using the API.

You also should set the breadth-first crawling depth to a large value.

FWIW, the BBC Linked Data interface has a long-standing issue with
content negotiation [1].

Best regards,
Andreas.

[1]
https://groups.google.com/forum/?fromgroups=#!topic/pedantic-web/uRV4LQpccbE

Altruist

unread,

Jan 8, 2013, 1:14:08 PM1/8/13

to ldsp...@googlegroups.com

Thank you Andreas,

Assuming that I only need to crawl bbc.co.uk my seeds file would only contain a single URL www.bbc.co.uk. However this site may have a sitemap which has many URLs in it , it appears to me that LDSPider does not look for URLs in Sitemap.xml , in short I need to crawl the set of URLs in the sitemap.xml using LDSpider, how do I make LDSpider cognizant of the sitemap.xml ?

Thank You

Reply all

Reply to author

Forward