Cloudfront support - HTTP/2 download

49 views
Skip to first unread message

brano199

unread,
Aug 31, 2017, 9:11:05 AM8/31/17
to Common Crawl
Hello,

I have been trying to solve problem on how to download large amount of data from the S3 server,but it doesn' t seem to support HTTP/2. 

Since i need to do request a lot of chunks, i have grouped the chunks into bigger ones - there are at least as many requests as files needed to be read from. However i have trouble re-using the same connection for all of those requests so that i don' t need to handshake everytime.

I would like to use HTTP/2 multiplexing features, because HTTP/1.1 is blocking the line until i get the last chunk. Amazon Cloudfront says it supports HTTP/2, so is there any other way i could download CommonCrawl data from there?

If i use http://, it works, but i can' t re-use the connection, because the connection gets closed and it is not using HTTP/2 anyway.

curl --verbose --http2 -r 822555329-822557378        -O        http://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 54.231.72.99...
* TCP_NODELAY set
* Connected to commoncrawl.s3.amazonaws.com (54.231.72.99) port 80 (#0)
> GET /crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz HTTP/1.1
> Host: commoncrawl.s3.amazonaws.com
> Range: bytes=822555329-822557378
> User-Agent: curl/7.55.1
> Accept: */*
> Connection: Upgrade, HTTP2-Settings
> Upgrade: h2c
> HTTP2-Settings: AAMAAABkAARAAAAAAAIAAAAA

< HTTP/1.1 206 Partial Content
< x-amz-id-2: 9lgvs/cWfLu/hRZqCE95N/K5Z2em2Y97H+e6lxgm7hiUNKRFRs0jpXvX7MPuQLmXKNqmX4SF2S8=
< x-amz-request-id: 5317AB8E0FEBF542
< Date: Thu, 31 Aug 2017 13:02:35 GMT
< Last-Modified: Wed, 25 Jan 2017 09:48:01 GMT
< ETag: "757d2cf5892fdfb5015d5aec97736484"
< Accept-Ranges: bytes
< Content-Range: bytes 822555329-822557378/1012443882
< Content-Type: application/octet-stream
< Content-Length: 2050
< Server: AmazonS3
< Connection: close

{ [2050 bytes data]
100  2050  100  2050    0     0   2050      0  0:00:01 --:--:--  0:00:01  4756
* Closing connection 0

Only way to get it working is to use https, but then it is not using HTTP/2,but it switched to HTTP/1.1


curl --verbose --http2 -r 822555329-822557378        -O        https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.216.81.232...
* TCP_NODELAY set
* Connected to commoncrawl.s3.amazonaws.com (52.216.81.232) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [87 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2514 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=US; ST=Washington; L=Seattle; O=Amazon.com Inc.; CN=*.s3.amazonaws.com
*  start date: Jul 29 00:00:00 2016 GMT
*  expire date: Nov 29 12:00:00 2017 GMT
*  subjectAltName: host "commoncrawl.s3.amazonaws.com" matched cert's "*.s3.amazonaws.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert Baltimore CA-2 G2
*  SSL certificate verify ok.
} [5 bytes data]
> GET /crawl-data/CC-MAIN-2017-04/segments/1484560279657.18/warc/CC-MAIN-20170116095119-00156-ip-10-171-10-70.ec2.internal.warc.gz HTTP/1.1
> Host: commoncrawl.s3.amazonaws.com
> Range: bytes=822555329-822557378
> User-Agent: curl/7.55.1
> Accept: */*

{ [5 bytes data]
< HTTP/1.1 206 Partial Content
< x-amz-id-2: /kg+/+2HlaZf6CiJo/xuDynJ5erfJNnyJZ5g4ShaGn5PC1PWvTJ1GVhKLzILR5feGkbpl6Hho7Y=
< x-amz-request-id: F7C218500628B7BF
< Date: Thu, 31 Aug 2017 13:04:07 GMT
< Last-Modified: Wed, 25 Jan 2017 09:48:01 GMT
< ETag: "757d2cf5892fdfb5015d5aec97736484"
< Accept-Ranges: bytes
< Content-Range: bytes 822555329-822557378/1012443882
< Content-Type: application/octet-stream
< Content-Length: 2050
< Server: AmazonS3

  0  2050    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0{ [5 bytes data]
100  2050  100  2050    0     0   2050      0  0:00:01 --:--:--  0:00:01  3193
* Connection #0 to host commoncrawl.s3.amazonaws.com left intact

brano199

unread,
Sep 1, 2017, 6:39:44 AM9/1/17
to Common Crawl
I have found out that AWS supports many concurrent requests,so according their guidelines http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html, i have settled up with 300 socket connection which i constantly re-use, this uses all my download speed ~ 30 MB/s. 

I have found the solution after all :)
Reply all
Reply to author
Forward
0 new messages