Question about the nature of the url.txt and the "network" we're considering

57 views
Skip to first unread message

saqaw

unread,
Dec 2, 2012, 2:47:20 PM12/2/12
to csc-32...@googlegroups.com
Hi,

I had a question about the nature of the list of urls we're testing with and the network they create. So if I'm giving a list of 10 urls, and each of these urls has x amount of links in it. Can I assume all links inside any of these urls is going to be limited to the url list (meaning they create a closed off network of websites)? 
If I cannot assume it is, then when I search for a keyword, do I only return urls from the initial list (the original 10) or do any subsequent urls I have crawled through. 

Wesley May

unread,
Dec 2, 2012, 2:59:25 PM12/2/12
to csc-32...@googlegroups.com
The URL list may have URLs that link to other sites not in the list. However, you only have to crawl to a depth of 1. So you should be able to return results for any site that is reachable in 1 step from a URL in the list.

Sneha

unread,
Dec 7, 2012, 12:46:38 PM12/7/12
to csc-32...@googlegroups.com
Hi,

Just to clarify the above question. When a keyword exists in a depth URL but not in the base URL given in url_list, do we return the depth URL in the result or the base_url?

Thanks!
-Sneha

Wesley May

unread,
Dec 7, 2012, 1:41:49 PM12/7/12
to csc-32...@googlegroups.com
Your results should return the site that actually *has* the keyword, not just a site that *links* to that site.

So if you list only has abc.com, and it links to an xyz.com page with "cheese" on it, then searching for "cheese" should give back xyz.com.

Sneha

unread,
Dec 7, 2012, 3:12:53 PM12/7/12
to csc-32...@googlegroups.com
Thanks a lot!

-Sneha
Reply all
Reply to author
Forward
0 new messages