url crawled but links to same domain not crawled, possibly just pagination or other limits I am not aware of?

28 views
Skip to first unread message

David Cottrell

unread,
Sep 11, 2017, 4:08:23 PM9/11/17
to Common Crawl
The following url is basically a list of links on the site. Some of these links appear to have not been hit or maybe they are just not indexed?

Does anyone know if this is expected? Happy to investigate a bit further if not.

$ ./cdx-index-client.py -c CC-MAIN-2017-34 www.bis.org/list/wpapers/index.htm
2017-09-11 21:00:16,449: [INFO]: Getting Index From http://index.commoncrawl.org/CC-MAIN-2017-34-index
2017-09-11 21:00:16,779: [INFO]: Fetching 1 pages of www.bis.org/list/wpapers/index.htm

$ ./cdx-index-client.py -c CC-MAIN-2017-34 www.bis.org/publ/work622.htm
2017-09-11 21:00:51,783: [INFO]: Getting Index From http://index.commoncrawl.org/CC-MAIN-2017-34-index
No results found for: www.bis.org/publ/work622.htm
2017-09-11 21:00:54,120: [INFO]: Fetching 0 pages of www.bis.org/publ/work622.htm

$ ./cdx-index-client.py -c CC-MAIN-2017-34 www.bis.org/list/wpapers/page_3.htm
2017-09-11 21:01:12,438: [INFO]: Getting Index From http://index.commoncrawl.org/CC-MAIN-2017-34-index
2017-09-11 21:01:12,821: [INFO]: Fetching 0 pages of www.bis.org/list/wpapers/page_3.htm

Sebastian Nagel

unread,
Sep 11, 2017, 4:28:15 PM9/11/17
to common...@googlegroups.com
Hi David,

there are many reasons why a page wasn't crawled this month:
- URL still unknown or not included into sample (we have to sample)
- excluded by robots.txt or failed otherwise
- crawled in a previous month and not yet scheduled for refetch

The latter is the reason for www.bis.org/publ/work622.htm, it's included in the July 2017 crawl:
http://test-index.commoncrawl.org/CC-MAIN-2017-30-index?url=www.bis.org/publ/work622.htm

For www.bis.org/list/wpapers/page_3.htm : there are only 3 items on this page as of today.
It could be that early in August when the August crawl was launched there wasn't a page 3 listing
the working papers of bis.org. :)

In any case, I recommend to look at 2-3 consecutive monthly crawls to get a mostly complete list.
Of course, without any guarantee because we can only crawl a sample of the web.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages