List of 1,000 websites

590 views
Skip to first unread message

Julius Hamilton

unread,
May 22, 2022, 5:23:42 PM5/22/22
to common...@googlegroups.com
Hey,

I am looking for an evenly distributed list of URLs that attempt to represent the web evenly.

It is meant to be a representative sampling.

It should be diverse, in terms of webpage type, content type, etc.

The reason why is I would like to test a certain web scraping tool on as diverse a set of web pages as possible - to ensure it works across all websites.

I haven’t been able to find any information about how to pull something like that off.

Is there any aspect of Common Crawl I could use to extract a diverse representation of URLs?

For example, are there URLs in common crawl that are classified by web page type? I mean specific underlying web technologies, maybe - React, Ajax, JavaScript, Pyscript, etc, any web framework whatsoever.

Or by category? Newspaper, social media, blog, etc?

If not, I’m just gonna write my own scraper of Wikipedia and hope to get some diverse URLs there.

Thanks very much,
Julius

Sebastian Nagel

unread,
May 23, 2022, 4:14:41 PM5/23/22
to common...@googlegroups.com
Hi Julius,

the sampling of URLs for Common Crawl is based on random and
hyperlink centrality - a higher ranking site is allowed to
contribute more pages to the selected sample.

You find the latest centrality ranks on the level of host and
registered domains here:

https://commoncrawl.org/2022/03/host-and-domain-level-web-graphs-oct-nov-jan-2021-2022/

The crawler clearly favors HTML over any other document format
and avoids link spam. I'm not sure whether this would violate
your requirements of an evenly and representative sample?

The crawler is operated in North America which also add some
bias. More details here (or in prior discussions on this list):

https://indico.cern.ch/event/1006978/contributions/4539477/attachments/2325769/3962907/ossym2021-sn-web-graphs-crawling.pdf

Metrics about the coverage of top-level domains or content languages
in Common Crawls here:
https://commoncrawl.github.io/cc-crawl-statistics/

In case, you want to sample URLs from CC, there's a ready to use
example query:

https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/examples/cc-index/random-sample-urls.sql
for the columnar index:

https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

> If not, I’m just gonna write my own scraper of Wikipedia and hope to
> get some diverse URLs there.

Not a bad idea. Esp. because you could sample the URLs based on
article views:
https://dumps.wikimedia.org/other/analytics/

Best,
Sebastian

Tom Alby

unread,
May 23, 2022, 4:26:56 PM5/23/22
to Common Crawl
Hey Julius,

I had the same challenge, and I'd refrain from using Wikipedia as it is not representative at all. Also, you don't need a scrape as you can simply download the data. Using a sample of Common Crawl data is probably the best choice even though there are some caveats. I have just submitted a paper about this (and Sebastian gave me valuable input), pls drop me a line if you want to know more.

Best

Tom

Julius Hamilton

unread,
Jun 13, 2022, 2:11:52 PM6/13/22
to common...@googlegroups.com
Thank you very much.

I will look more into this soon.

So Common Crawl provides data about the centrality of webpages given how frequently they are linked to? Is CC able to know how often a page is visited?

Is it possible for a spider to crawl “outwards” by starting at one or a few central places and trying to evenly move to more and more obscure websites?

I read that Common Crawl is petabytes large and we download it using the S3 tool.

Therefore it would be unfeasible to download this data and we should use the GitHub tools for selecting subsets of the data?

Thank you,
Julius

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/4ac94425-c1a8-4e4b-a318-9147fecc094an%40googlegroups.com.

Julius Hamilton

unread,
Jul 14, 2022, 10:11:18 AM7/14/22
to Common Crawl
Cool - with Sebastian's help I was finally able to get going with Athena to access the Common Crawl - and sure enough, there was an extremely convenient SQL function to access 100 random URLs. It worked perfectly. Thanks very much.

- Julius

Julius Hamilton

unread,
Jul 19, 2022, 10:05:11 AM7/19/22
to common...@googlegroups.com
One question, this SQL program: 

SELECT url
FROM "ccindex"."ccindex"
TABLESAMPLE BERNOULLI (.5)
WHERE crawl = 'CC-MAIN-2020-34'
  AND (subset = 'warc' OR subset = 'crawldiagnostics')

This returned only URLs to blogs, and some of the pages could not be found/were no longer accessible.

The latter I assume is because this crawl is two years old.

But why did it return only blogs?

Thank you,
Julius


You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/Kxsdz094UCI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/71da1daf-3f0b-4784-9bca-a8c35c08a778n%40googlegroups.com.

Sebastian Nagel

unread,
Jul 19, 2022, 12:51:42 PM7/19/22
to common...@googlegroups.com
Hi Julius,

> This returned only URLs to blogs, and some of the pages could not be
> found/were no longer accessible.

Did you browse through all result pages?

If I run the query and extract the host names from all sampled URL,
there is a broad variety of host names. Although, the first result page
shown in the Athena console includes only sites from blogspot.com.
But this is by accident - sampling does not necessarily also mean
shuffling.

Note: you eventually want to adjust the sampling rate -
0.5%
of 3 billion captures (successful fetches *and* 404s, redirects, etc.)
=> 15 million


> The latter I assume is because this crawl is two years old.

Yes, of course. Just replace 'CC-MAIN-2020-34' by 'CC-MAIN-2022-27'
or the newest crawl. For a list of crawls, see
https://data.commoncrawl.org/crawl-data/index.html

or run the Athena query

SHOW PARTITIONS ccindex;

This will show all available partitions as combination <crawl,subset>.

Best,
Sebastian

On 7/19/22 16:04, Julius Hamilton wrote:
> One question, this SQL program: 
> https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/examples/cc-index/random-sample-urls.sql
> <https://github.com/commoncrawl/cc-index-table/blob/main/src/sql/examples/cc-index/random-sample-urls.sql>
>
> SELECT url
> FROM "ccindex"."ccindex"
> TABLESAMPLE BERNOULLI (.5)
> WHERE crawl = 'CC-MAIN-2020-34'
>   AND (subset = 'warc' OR subset = 'crawldiagnostics')
>
> This returned only URLs to blogs, and some of the pages could not be
> found/were no longer accessible.
>
> The latter I assume is because this crawl is two years old.
>
> But why did it return only blogs?
>
> Thank you,
> Julius
>
>
> On Thu 14. Jul 2022 at 16:11, Julius Hamilton

Julius Hamilton

unread,
Jul 20, 2022, 10:05:36 AM7/20/22
to Common Crawl
Thank you very much.

I am trying to incorporate the ability to retrieve N random URLs into a Python script.

I am considering either using the AWS SDK for Python "Boto3" or the AWS CLI.

Because I already set up my instance and everything I am pretty sure I would just need to use the command "start_query_execution": https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena.html#Athena.Client.start_query_execution

I'll look into this and perhaps if you know anything about this you could let me know.

Thanks very much,
Julius

Sebastian Nagel

unread,
Jul 20, 2022, 11:37:56 AM7/20/22
to common...@googlegroups.com
Hi Julius,

have a look at PyAthena (https://pypi.org/project/PyAthena/),
a quite comprehensive API for Athena which encapsulates the
remaining parts - waiting for the query to finish, looking
for errors, downloading and iterating over the results, etc.

Best,
Sebastian

On 7/20/22 16:05, Julius Hamilton wrote:
> Thank you very much.
>
> I am trying to incorporate the ability to retrieve N random URLs into a
> Python script.
>
> I am considering either using the AWS SDK for Python "Boto3" or the AWS CLI.
>
> Because I already set up my instance and everything I am pretty sure I
> would just need to use the command
> "start_query_execution": https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/athena.html#Athena.Client.start_query_execution
>
> I'll look into this and perhaps if you know anything about this you
> could let me know.
>
> Thanks very much,
> Julius
>
>
>
> On Tuesday, July 19, 2022 at 6:51:42 PM UTC+2 Sebastian Nagel wrote:
>
> Hi Julius,
>
> > This returned only URLs to blogs, and some of the pages could not be
> > found/were no longer accessible.
>
> Did you browse through all result pages?
>
> If I run the query and extract the host names from all sampled URL,
> there is a broad variety of host names. Although, the first result page
> shown in the Athena console includes only sites from blogspot.com
> <http://blogspot.com>.
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/ffaf26b2-9799-41a7-bd24-ce47fcda50c0n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/ffaf26b2-9799-41a7-bd24-ce47fcda50c0n%40googlegroups.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages