LDSpider and robots.txt

21 views
Skip to first unread message

Altruist

unread,
Jan 10, 2013, 3:02:50 PM1/10/13
to ldsp...@googlegroups.com

Hi All,

I have noticed that when LDSpider is given a URL to follow it apparently reads the robots.txt does I say that when I give a specific page in a website to be followed I get the following message in the console .

1357845366 280 127.0.0.1 TCP_MISS/200 2154 GET http://www.guardian.co.uk/robots.txt - NONE/- text/plain
Jan 10, 2013 2:16:06 PM com.ontologycentral.ldspider.http.internal.ResponseGzipUncompress process
INFO: gzip compression

Does this mean that the Disallow directive in the robots.txt would be respected by LDSpider for all the URLs passed into the import com.ontologycentral.ldspider.frontier.Frontier class.

I am passing a set of URLs to the Frontier class and I need to make sure that LDSpider ignores any URLs that are present in the disallow directive of the robots.txt file.

Can anyone please confirm the behavior of LDSpider.

Thank You.

Reply all
Reply to author
Forward
0 new messages