Hello,
I am developer of the following library
https://github.com/hynky1999/CmonCrawl.
I have recently added support for AWS Athena, which should be as transparent as possible.
Basically user provides the domains, date range etc.. he wants to query and the library makes a sql query to athena. The resulting query can look like this:
```
SELECT cc.url,
cc.fetch_time,
cc.warc_filename,
cc.warc_record_offset,
cc.warc_record_length
FROM "commoncrawl"."ccindex" AS cc
WHERE (cc.fetch_time BETWEEN CAST('2021-01-01 00:00:00' AS TIMESTAMP) AND CAST('2021-04-28 23:59:59' AS TIMESTAMP)) AND (cc.crawl = 'CC-MAIN-2021-49' OR cc.crawl = 'CC-MAIN-2021-43' OR cc.crawl = 'CC-MAIN-2021-39' OR cc.crawl = 'CC-MAIN-2021-31' OR cc.crawl = 'CC-MAIN-2021-25' OR cc.crawl = 'CC-MAIN-2021-21' OR cc.crawl = 'CC-MAIN-2021-17' OR cc.crawl = 'CC-MAIN-2021-10' OR cc.crawl = 'CC-MAIN-2021-04') AND (cc.fetch_status = 200) AND (cc.subset = 'warc') AND ((cc.url_host_name = '
web2.co.uk' OR cc.url_host_name = '
www.web2.co.uk') OR (cc.url_host_name = '
web1.com' OR cc.url_host_name = '
www.web1.com')) AND (REGEXP_LIKE(cc.url_path, '^(/(en-uk|en-us|en-gb|en))?(/[^/]+){0,1}/?$')) AND cc.content_mime_type = 'text/html' AND LENGTH(cc.url_path) < 60 AND cc.url_query IS NULL OR cc.url_query = '')
ORDER BY url
```
When I was playing with that about 3 weeks prior, I had no problem with it. But since about 2 weeks ago, all queries are failing because of `
Error opening Hive split s3..... Slow down`
This error only shows when certain cc.crawl are included in query. The question is how to avoid this problem? From my understanding each AWS s3 bucket prefix can use up
5,500 GET/HEAD requests. I checked the problematic crawl eg
. s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-39/subset=warc/ and there is 300 files in there. So making query over the whole cc.crawl with subset=warc should at most make 300 request/second.
Is there anything I can do about this ? I thought there was insane number of requests only to s3 buckets containing the actual data eg. `s3://commoncrawl/crawl-data/`.