Using PySpark

36 views
Skip to first unread message

Julius Hamilton

unread,
Jul 1, 2022, 10:47:48 AMJul 1
to Common Crawl
Hey,

I am trying to read this to use PySpark to get a list of 1,000 common websites: https://github.com/commoncrawl/cc-pyspark

These links were suggested to me in order to do this:



There's quite a lot to read through there.

I'm seeing some sample code like this:

$SPARK_HOME/bin/spark-submit ./server_count.py \    --num_output_partitions 1 --log_level WARN \    ./input/test_warc.txt servernames

So you use "Spark" to run some kind of python script?

Could anyone provide sample code for me which downloads 1,000 diverse websites from Common Crawl? (Just the URLs, not the page text.) The sample code could help me return to the documentation and understand it better.

Thanks,
Julius

Sebastian Nagel

unread,
Jul 7, 2022, 11:16:59 AMJul 7
to common...@googlegroups.com
Hi Julius,

> Could anyone provide sample code for me which downloads 1,000 diverse
> websites from Common Crawl? (Just the URLs, not the page text.)

If you only need URLs or site names (domain names), there is not need
for running a Spark job - you should use the URL index. If it's about
domain names you may have a look at the hyperlink graphs and graph-based
rankings, eg.
https://commoncrawl.org/2022/03/host-and-domain-level-web-graphs-oct-nov-jan-2021-2022/

Best,
Sebastian

Julius Hamilton

unread,
Jul 11, 2022, 8:24:21 AMJul 11
to Common Crawl
Thank you - so I take it this is my instruction guidepage: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

I should run an SQL query with Athena.

I'll give that a shot.

Thanks very much,
Julius
Reply all
Reply to author
Forward
0 new messages