Host and Domain Level Webgraph Creation Frequency

69 views
Skip to first unread message

Matthew Wilson

unread,
Sep 21, 2023, 1:49:01 PM9/21/23
to Common Crawl
Hello,

I've noticed that the frequency of the webgraph data you all generate has decreased over the years. Looking at "s3://commoncrawl/projects/hyperlinkgraph/" I see 4 data dumps with data from 2017, 6 with data from 2018, 5 with data from 2019, 4 with data from 2020, 4 with data from 2021, 3 with data from 2022, and 1 so far with data from 2023.

I'm wondering what frequency of webgraph generation I can expect going forward because I find this data very useful.

Thanks a lot,
Matt

PS. I am a big fan of the work you all do.

Rich Skrenta

unread,
Sep 21, 2023, 5:48:00 PM9/21/23
to common...@googlegroups.com
Thanks for your note, Matthew. And we appreciate the kind words!

We have a goal to increase our crawl cadence in the near future.

Please let us know if we can be helpful.

Best,
Rich


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/3113ab24-2a05-4939-9f87-2bef42e0345dn%40googlegroups.com.

Matthew Wilson

unread,
Sep 22, 2023, 10:58:21 AM9/22/23
to Common Crawl
The "Crawl Archives" are on a separate release cadence from the "Web Graphs", right? I see at least 3 "Crawl Archive" releases this year (January/February, March/April, May/June), whereas the most recent "Web Graph" I see was data from late '22 into January of '23. Am I interpreting correctly that the goal is to increase the cadence of both the "Crawl Archive" and the "Web Graph" releases?

Thanks,
Matt

Greg Lindahl

unread,
Sep 23, 2023, 12:30:24 PM9/23/23
to 'Matthew Wilson' via Common Crawl
Matt,

Traditionally the webgraph computation is done over 3 crawls, and yes
it will become more frequent as crawls are more frequent.

I wasn't around when this was first set up Sebastian, but my suspicion
is that the graphs are not stable if computed on a single crawl. We
use the web graph output ourselves as a quality signal to improve our
choices of which pages to crawl. That feedback loop has to be stable.

If you have an opinion about these graphs, I'd love to hear it! We
could switch to a running "previous 3 crawls" strategy instead of
"every 3rd crawl" if that was valuable to people -- 3 times the cost
to compute, still stable, fresher than our current strategy.

-- greg
> >> <https://groups.google.com/d/msgid/common-crawl/3113ab24-2a05-4939-9f87-2bef42e0345dn%40googlegroups.com?utm_medium=email&utm_source=footer>
> >> .
> >>
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/a186bf68-fcbf-40ab-8b55-5dcc01302f5fn%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages