Suggestion to build backlinks alongh with columnar index

46 views
Skip to first unread message

Bpm Tips

unread,
Jun 15, 2022, 2:15:05 PMJun 15
to Common Crawl
A very good use case for SEO is backlinks, i beleive the crawl data will be very useful if a backlinks index can be preprocessed  and made available with each crawl.

Sebastian Nagel

unread,
Jun 20, 2022, 2:04:18 PMJun 20
to common...@googlegroups.com
Hi,

what about the host/domain-level webgraphs?

https://commoncrawl.org/2022/03/host-and-domain-level-web-graphs-oct-nov-jan-2021-2022/

Best,
Sebastian

Bpm Tips

unread,
Aug 10, 2022, 9:43:39 AMAug 10
to Common Crawl
I may be wrong but i think the graph contains relationships only not the actual backlinks, please advise if there is a way to get the backlinks from the webgraph.

Sebastian Nagel

unread,
Aug 10, 2022, 11:33:41 AMAug 10
to common...@googlegroups.com
Hi,

the host-level graph is built from links pointing from the host of one
web page to the host of another web page. Same for registered domains
and the domain-level graph.

The transpose of the graph contains the backlinks from host to host,
resp. domain to domain. Some questions are easy to answer, eg. which
hosts/domains link to a host/domain of interest.

The graphs are compact but are
- not on page-level,
- notoriously incomplete (the crawls only cover a sample of the web)
- and do not include link attributes or anchor texts

I agree that there are great use cases for a backlink index. However,
an average main crawl includes 500+ billion page-level links. That's
100 times as many rows/records as the index of the crawled pages.
Consequently, the full backlink index would be expensive to create and
download, and also not cheap to query.

For now, one way to process data to extract backlinks are the WAT files.
Alternatively, you could look first for backlinking hosts/domains and
then pick pages by host/domain name.

Best,
Sebastian

Bpm Tips

unread,
Aug 10, 2022, 11:41:46 AMAug 10
to Common Crawl
thanks for the explanation, I plan to build such a backlink index by using spark and https downloads using WAT files, I have some decent processing power and a good spark cluster, any pointers before I embark on this. Don't plan to download entire WAT files but just download process and discard the raw data. Wont be using S3 storage since I have existing computing resources and bandwidth and dont want to waste resource costs on EC2.
I plan to use the following

Reply all
Reply to author
Forward
0 new messages