Has anyone adapted the cc-pyspark examples to https?

80 views
Skip to first unread message

Vittorio Rossi

unread,
Mar 3, 2024, 3:37:23 PMMar 3
to Common Crawl
Hello,
I've been testing the .py examples provided in the cc-pyspark github repo.
I see there are some mentions of the new https endpoint here and there, e.g. with the --input_base_url parameter, but I am not competent enough to successfully turn examples such as the following into its https alternative.

$SPARK_HOME/bin/spark-submit \
    --packages org.apache.hadoop:hadoop-aws:3.3.2 \
    ./cc_index_word_count.py \
    --input_base_url s3://commoncrawl/ \
    --query "SELECT url, warc_filename, warc_record_offset, warc_record_length, content_charset FROM ccindex WHERE crawl = 'CC-MAIN-2020-24' AND subset = 'warc' AND url_host_tld = 'is' LIMIT 10" \
    s3a://commoncrawl/cc-index/table/cc-main/warc/ \
    myccindexwordcountoutput \
    --num_output_partitions 1 \
    --output_format json

If anyone has already looked into it and has a solution, some advice would be really helpful.

Sebastian Nagel

unread,
Mar 4, 2024, 5:36:29 AMMar 4
to common...@googlegroups.com
Hi Vittorio,

to read the WARC records, just replace

--input_base_url s3://commoncrawl/

by

--input_base_url https://data.commoncrawl.org/

However, executing the SQL query using Spark with the data on S3, still
requires authenticated AWS access. That's because the S3A protocol
implements a Hadoop filesystem allowing for "directory" listings which
are necessary to get all Parquet files holding the columnar index
(here for a specific partition, crawl=CC-MAIN-2020-24 and subset=warc).
There is no way to execute a Spark SQL query on the columnar index
remotely using the HTTPS access.

For additional information, see the README of cc-pyspark.

Best,
Sebastian

On 3/3/24 15:29, Vittorio Rossi wrote:
> Hello,
> I've been testing the .py examples provided in the cc-pyspark github
> repo <https://github.com/commoncrawl/cc-pyspark>.

Vittorio Rossi

unread,
Mar 4, 2024, 5:55:52 AMMar 4
to Common Crawl
This makes sense.
If I understand correctly, the s3a protocol allows for schema discovery, which doesn't happen over https. By changing the input_base_url, the data.commoncrawl endpoint is used to retrieve documents but not to locate them.
Right?

Sebastian Nagel

unread,
Mar 4, 2024, 6:15:30 AMMar 4
to common...@googlegroups.com
Hi Vittorio,

the underlying problem is that https is an access scheme for files (or
file-like data) but no file system. In order to figure out which files
Spark requires to read, first Spark requests a listing of the
subdirectories below
s3a://commoncrawl/cc-index/table/cc-main/warc/
and later lists all files on

s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2020-24/subset=warc/
and do schema discovery for those files.

The HTTP(S) protocol simply does not provide a method to list all
paths below a prefix, while S3A does. Under the hood the S3 REST API
[1] (used by S3A) uses HTTP(S) with extra calls for authentication
and functionality such as directory/prefix listings.

Best,
Sebastian

[1] https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html

On 3/4/24 11:55, Vittorio Rossi wrote:
> This makes sense.
> If I understand correctly, the s3a protocol allows for schema discovery,
> which doesn't happen over https. By changing the input_base_url, the
> /data.commoncrawl /endpoint is used to retrieve documents but not to
Reply all
Reply to author
Forward
0 new messages