503 error for Athena common crawl

zhan su

unread,

Oct 22, 2023, 5:51:24 PM10/22/23

to Common Crawl

Hello, I am using the athena to run some SQL for common crawl. However, when I use the example from websites, there is an error:

SELECT COUNT(*) AS count,
url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
AND subset = 'warc'
AND url_host_tld = 'no'
GROUP BY url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY count DESC

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-05/subset=warc/part-00128-248eba37-08f7-4a53-a4b4-d990640e4be4.c000.gz.parquet (offset=33554432, length=67108864): com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: 1N73JJWB0JBP7E8S; S3 Extended Request ID:

It seems that the common crawl limit the requests. Could you please help me with that?

Greg Lindahl

unread,

Oct 22, 2023, 6:49:57 PM10/22/23

to common...@googlegroups.com

Zhan Su,

This is the same problem with 503 errors that we're already talking
about. It started affecting Athena very recently.

Sorry,

-- greg

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/05063fda-fe26-4982-8009-03624d2ff8a3n%40googlegroups.com.

Zhan Su

unread,

Oct 24, 2023, 9:58:45 AM10/24/23

to common...@googlegroups.com

Thanks for your feedback. do you know how long it will take to fix this problem?

Best,

Zhan

To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/20231022224950.GA6944%40rd.bx9.net.

Amir Shukayev

unread,

Oct 25, 2023, 2:05:47 PM10/25/23

to Common Crawl

Which AWS region are you running this from? Have you tried using us-east-1

Zhan Su

unread,

Oct 25, 2023, 2:31:13 PM10/25/23

to common...@googlegroups.com

Actually, I used us-east-1

To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/3c8f8c44-799d-4c11-bb0e-7378f8245bb0n%40googlegroups.com.

Henry S. Thompson

unread,

Oct 25, 2023, 6:41:45 PM10/25/23

to common...@googlegroups.com

I can now report modest success with running two simultaneous serial
threads, averaging 2 minutes per warc.gz file from CC-MAIN-2023-40 to
download half a segment (456 files, 228 by each thread) in just under
8 hours.

Fastest single download was 50 seconds in one thread, 49 in the other

I'm guessing since no-one else is reporting any success that what I'm
seeing depends crucially on the settings reported in my previous
message:

aws configure --profile [yourChoice] set s3.multipart_threshold 4GB
aws configure --profile [yourChoice] set s3.max_concurrent_requests 1
aws configure --profile [yourChoice] set s3.multipart_chunksize 32MB
aws configure --profile [yourChoice] set retry_mode adaptive
aws configure --profile [yourChoice] set max_attempts 100

Then, in your actual script for fetching:

aws s3 cp s3://commoncrawl/... ... --profile yourChoice --only-show-errors

If you're not interested in more detailed stats, look away now...

Timing details in seconds, run between 0927 and 1713 GMT on 25 October.

n min max mean std. dev.
Thread 1: 227 49 1106 120.0 123.6
Thread 2: 227 50 686 123.3 114.4

20-bucket histograms:

Thread 1
75.425 155 ****************************************************************
128.275 21 ********
181.125 14 *****
233.975 15 ******
286.825 9 ***
339.675 5 **
392.525 2 *
445.375 2 *
498.225 1 *
551.075 1 *
603.925 0
656.775 0
709.625 0
762.475 0
815.325 0
868.175 1 *
921.025 0
973.875 0
1026.725 0
1079.575 1 *

Thread 2
65.900 137 ****************************************************************
97.700 19 ********
129.500 17 *******
161.300 10 ****
193.100 6 **
224.900 7 ***
256.700 4 *
288.500 9 ****
320.300 5 **
352.100 1 *
383.900 3 *
415.700 1 *
447.500 2 *
479.300 0
511.100 1 *
542.900 1 *
574.700 3 *
606.500 0
638.300 0
670.100 1 *

According to my notes, running 12 parallel threads with no
parameterisation about 2 years ago, it took 4 days to download
CC-MAIN-2017-30, 72000 files, which suggests the average per file
download time was around a minute (6000 files per thread, 1500 each
day, 62 each hour).

So as the above histograms show, although the minimum, average and
mode are close to that still, the distribution, and consequently the
mean and total throughput per thread, are now much worse.

I'd like to get all of 2023-40, but I'll wait a while to see if the
contention gets managed better...

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

Reply all

Reply to author

Forward