I can now report modest success with running two simultaneous serial
threads, averaging 2 minutes per warc.gz file from CC-MAIN-2023-40 to
download half a segment (456 files, 228 by each thread) in just under
8 hours.
Fastest single download was 50 seconds in one thread, 49 in the other
I'm guessing since no-one else is reporting any success that what I'm
seeing depends crucially on the settings reported in my previous
message:
aws configure --profile [yourChoice] set s3.multipart_threshold 4GB
aws configure --profile [yourChoice] set s3.max_concurrent_requests 1
aws configure --profile [yourChoice] set s3.multipart_chunksize 32MB
aws configure --profile [yourChoice] set retry_mode adaptive
aws configure --profile [yourChoice] set max_attempts 100
Then, in your actual script for fetching:
aws s3 cp s3://commoncrawl/... ... --profile yourChoice --only-show-errors
If you're not interested in more detailed stats, look away now...
Timing details in seconds, run between 0927 and 1713 GMT on 25 October.
n min max mean std. dev.
Thread 1: 227 49 1106 120.0 123.6
Thread 2: 227 50 686 123.3 114.4
20-bucket histograms:
Thread 1
75.425 155 ****************************************************************
128.275 21 ********
181.125 14 *****
233.975 15 ******
286.825 9 ***
339.675 5 **
392.525 2 *
445.375 2 *
498.225 1 *
551.075 1 *
603.925 0
656.775 0
709.625 0
762.475 0
815.325 0
868.175 1 *
921.025 0
973.875 0
1026.725 0
1079.575 1 *
Thread 2
65.900 137 ****************************************************************
97.700 19 ********
129.500 17 *******
161.300 10 ****
193.100 6 **
224.900 7 ***
256.700 4 *
288.500 9 ****
320.300 5 **
352.100 1 *
383.900 3 *
415.700 1 *
447.500 2 *
479.300 0
511.100 1 *
542.900 1 *
574.700 3 *
606.500 0
638.300 0
670.100 1 *
According to my notes, running 12 parallel threads with no
parameterisation about 2 years ago, it took 4 days to download
CC-MAIN-2017-30, 72000 files, which suggests the average per file
download time was around a minute (6000 files per thread, 1500 each
day, 62 each hour).
So as the above histograms show, although the minimum, average and
mode are close to that still, the distribution, and consequently the
mean and total throughput per thread, are now much worse.
I'd like to get all of 2023-40, but I'll wait a while to see if the
contention gets managed better...
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail:
h...@inf.ed.ac.uk
URL:
http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]