Vertices in webgraph vs wet conversion files

30 views
Skip to first unread message

Em Dil

unread,
May 3, 2024, 10:12:49 AMMay 3
to Common Crawl
Good afternoon,

I was wondering if all the vertices in the domain level webgraph have at least one or more record in the .wet files of type 'conversion'?
Can vertices appear in the webgraph because hyperlinks to them were found, but pages corresponding to a vertex are not necessarily crawled, or if they all are, some will have a .warc record where they may have blocked the crawler, so there is not .wet conversion record?

Thanks for your help.

Sebastian Nagel

unread,
May 4, 2024, 4:26:12 AMMay 4
to common...@googlegroups.com
Hi Em,


> Can vertices appear in the webgraph because hyperlinks to them were
> found, but pages corresponding to a vertex are not necessarily
> crawled,

Yes. There are multiple reasons why a domain name in the webgraph was not visited by the crawler: not reachable, disallowed by robots.txt,
not sampled in one of the crawls (it still may get sampled in the future), etc. See also:
  https://groups.google.com/g/common-crawl/c/LQcxqKF5QkQ/m/--jnmUyoBAAJ



> some will have a .warc record where they may have
> blocked the crawler, so there is not .wet conversion record?

In some cases, WARC records reflect why a site wasn't visited. There are WARC records for robots.txt responses, 404s, redirects and other "unsuccessful" fetches.

Only successfully fetched HTML pages (HTTP status code 200) are contained in WET files.

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages