Hi Vittorio,
the underlying problem is that https is an access scheme for files (or
file-like data) but no file system. In order to figure out which files
Spark requires to read, first Spark requests a listing of the
subdirectories below
s3a://commoncrawl/cc-index/table/cc-main/warc/
and later lists all files on
s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2020-24/subset=warc/
and do schema discovery for those files.
The HTTP(S) protocol simply does not provide a method to list all
paths below a prefix, while S3A does. Under the hood the S3 REST API
[1] (used by S3A) uses HTTP(S) with extra calls for authentication
and functionality such as directory/prefix listings.
Best,
Sebastian
[1]
https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html
On 3/4/24 11:55, Vittorio Rossi wrote:
> This makes sense.
> If I understand correctly, the s3a protocol allows for schema discovery,
> which doesn't happen over https. By changing the input_base_url, the
> /data.commoncrawl /endpoint is used to retrieve documents but not to