Querying AWS Athena

Hynek Kydlíček

unread,

Oct 28, 2023, 5:59:55 PM10/28/23

to Common Crawl

Hello,
I am developer of the following library https://github.com/hynky1999/CmonCrawl.
I have recently added support for AWS Athena, which should be as transparent as possible.
Basically user provides the domains, date range etc.. he wants to query and the library makes a sql query to athena. The resulting query can look like this:

```
SELECT cc.url,
cc.fetch_time,
cc.warc_filename,
cc.warc_record_offset,
cc.warc_record_length
FROM "commoncrawl"."ccindex" AS cc
WHERE (cc.fetch_time BETWEEN CAST('2021-01-01 00:00:00' AS TIMESTAMP) AND CAST('2021-04-28 23:59:59' AS TIMESTAMP)) AND (cc.crawl = 'CC-MAIN-2021-49' OR cc.crawl = 'CC-MAIN-2021-43' OR cc.crawl = 'CC-MAIN-2021-39' OR cc.crawl = 'CC-MAIN-2021-31' OR cc.crawl = 'CC-MAIN-2021-25' OR cc.crawl = 'CC-MAIN-2021-21' OR cc.crawl = 'CC-MAIN-2021-17' OR cc.crawl = 'CC-MAIN-2021-10' OR cc.crawl = 'CC-MAIN-2021-04') AND (cc.fetch_status = 200) AND (cc.subset = 'warc') AND ((cc.url_host_name = 'web2.co.uk' OR cc.url_host_name = 'www.web2.co.uk') OR (cc.url_host_name = 'web1.com' OR cc.url_host_name = 'www.web1.com')) AND (REGEXP_LIKE(cc.url_path, '^(/(en-uk|en-us|en-gb|en))?(/[^/]+){0,1}/?$')) AND cc.content_mime_type = 'text/html' AND LENGTH(cc.url_path) < 60 AND cc.url_query IS NULL OR cc.url_query = '')
ORDER BY url
```

When I was playing with that about 3 weeks prior, I had no problem with it. But since about 2 weeks ago, all queries are failing because of ` Error opening Hive split s3..... Slow down`

This error only shows when certain cc.crawl are included in query. The question is how to avoid this problem? From my understanding each AWS s3 bucket prefix can use up 5,500 GET/HEAD requests. I checked the problematic crawl eg. s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-39/subset=warc/ and there is 300 files in there. So making query over the whole cc.crawl with subset=warc should at most make 300 request/second.

Is there anything I can do about this ? I thought there was insane number of requests only to s3 buckets containing the actual data eg. `s3://commoncrawl/crawl-data/`.

yo...@yossi.at

unread,

Oct 28, 2023, 6:41:15 PM10/28/23

to common...@googlegroups.com

Hi Hynek,

In S3, only the first path element (commoncrawl) is the bucket. Everything else is the key in the bucket, so Common Crawl is all in one bucket.

The limits that are causing the “slow down” errors are per bucket, not per user or IP, so it’s not necessarily your requests that are causing the problem, but the aggregate of all users.

Yossi.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/9142f79a-0bff-4f84-883e-6d7ae8a9765dn%40googlegroups.com.

Hynek Kydlíček

unread,

Oct 28, 2023, 7:56:13 PM10/28/23

to Common Crawl

Yeah, but the s3 requests limit should be per prefix right ? Eg. commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2021-39/subset=warc/ is separate prefix. It must be this way as the problems are only for specific crawls.

My problem is that basically athena is completely unusable if we can't scan whole crawls and there is basically no way to query the index right now. If there was a way to bump up number of retries athena can make while accessing s3, it would solve the issue, but to my knowledge there is no such a way :(.

Dne neděle 29. října 2023 v 0:41:15 UTC+2 uživatel Yossi Tamari napsal:

Reply all

Reply to author

Forward