Hi Collin, hi Gordon,
> I know it's possible, but is it
> acceptable to make tens of million of requests?
> I wanted to verify I wasn't just shifting costs from
> myself onto CC before going down that path.
Thanks for asking. All costs for hosting the data and
data transfer are paid by Amazon as part of the Open Data program [1,2]. Thanks!
Note that some other data set buckets use a "requester pays" policy.
Indeed, fetching 25 million WARC records would cost $10 according to [1] ("$0.0004 per 1,000
requests"). However, 25 million WARC records occupy about 500 GB of storage (the March 2019 crawl
has 50 TiB WARC files containing 2.55 billion page captures), with $0.021 per GB (S3 Standard
Storage, Over 500 TB / Month) hosting the 500 GB also costs $10 per month. Unused "dead" data is
useless, so I would definitely encourage you to use it.
> This feels inefficient - 99% of URLs are removed by the URL/language filter.
> I could instead use the CDX or parquet indexes to retrieve the set of pages that meet
> the heuristic, and then make a request for each of them individually.
That's why we provide the indexes. So, use them. Even better if you can avoid useless data
transfers by picking only the needed data, esp. when accessing the data from remote.
> I haven't yet done the work to assess if this is a net win in speed/cost from my
> perspective (since, eg, the overhead of making 500x the # of requests may eat up the
> time saved from not transferring/decompressing 99% of the file).
Yes, there is definitely break-even point where it becomes cheaper and faster
to process the data sequentially and skip over unneeded records. My guess would be
that the break-even is above 1% and may even reach 10% if the data is processed in the us-east-1
region. But I haven't run any experiments to prove this. Also note that a S3 client library with
support for multi-part downloads may send 5-10 million requests to fetch all WARC files of a single
month with a default chunk size of 8 MB for boto3 and AWS CLI [4,5].
If I get any information on this I'll share it. If anybody has already run experiments
please let us know!
Best and thanks,
Sebastian
[1]
https://aws.amazon.com/opendata/
[2]
https://registry.opendata.aws/
[3]
https://aws.amazon.com/s3/pricing/?nc=sn&loc=4
[4]
https://boto3.amazonaws.com/v1/documentation/api/latest/_modules/boto3/s3/transfer.html
[5]
https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-chunksize
On 4/27/19 1:23 PM, Colin Dellow wrote:
> Hello list,
>
> I have a non-technical question about request rates and limits: I know it's possible, but is it
> acceptable to make tens of million of requests?
>
> For context, I have some code that classifies domains. To start with, I'm happy to only consider
> English-language home pages of each domain, as determined by a rough heuristic (eg root URL,
> /index.html, /index.htm, etc) plus the pre-existing language classifications. My code currently
> processes entire WARC files, filtering out URLs that don't match this filter before applying the
> more expensive classifier function.
>
> This feels inefficient - 99% of URLs are removed by the URL/language filter. I could instead use the
> CDX or parquet indexes to retrieve the set of pages that meet the heuristic, and then make a request
> for each of them individually.
>
> However, because S3 doesn't support multiple ranges in the Range header, I'd have to make exactly 1
> request per URL, so 20-30 million requests for a single crawl. That amount is almost 1% of the
> requests that were made globally in November 2018 (as detailed by Sebastian here
> <
https://groups.google.com/d/msg/common-crawl/qLwByLGRxjQ/wiNU3xyZBwAJ>). At S3's posted rack rates,
> it'd be ~$10 in request costs alone. If I then did this for multiple crawls, or for different
> classifiers, I'd become a noticeable blip in the Common Crawl's usage data, which gives me pause. I
> think Amazon provides free (or subsidized?) hosting as part of the public datasets program, so the
> costs are somewhat irrelevant, but maybe Amazon has given some guidance?
>
> I haven't yet done the work to assess if this is a net win in speed/cost from my perspective (since,
> eg, the overhead of making 500x the # of requests may eat up the time saved from not
> transferring/decompressing 99% of the file). I wanted to verify I wasn't just shifting costs from
> myself onto CC before going down that path.
>
> thanks!
> Colin
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.