How many requests is too many?

Colin Dellow

unread,

Apr 27, 2019, 7:23:24 AM4/27/19

to Common Crawl

Hello list,

I have a non-technical question about request rates and limits: I know it's possible, but is it acceptable to make tens of million of requests?

For context, I have some code that classifies domains. To start with, I'm happy to only consider English-language home pages of each domain, as determined by a rough heuristic (eg root URL, /index.html, /index.htm, etc) plus the pre-existing language classifications. My code currently processes entire WARC files, filtering out URLs that don't match this filter before applying the more expensive classifier function.

This feels inefficient - 99% of URLs are removed by the URL/language filter. I could instead use the CDX or parquet indexes to retrieve the set of pages that meet the heuristic, and then make a request for each of them individually.

However, because S3 doesn't support multiple ranges in the Range header, I'd have to make exactly 1 request per URL, so 20-30 million requests for a single crawl. That amount is almost 1% of the requests that were made globally in November 2018 (as detailed by Sebastian here). At S3's posted rack rates, it'd be ~$10 in request costs alone. If I then did this for multiple crawls, or for different classifiers, I'd become a noticeable blip in the Common Crawl's usage data, which gives me pause. I think Amazon provides free (or subsidized?) hosting as part of the public datasets program, so the costs are somewhat irrelevant, but maybe Amazon has given some guidance?

I haven't yet done the work to assess if this is a net win in speed/cost from my perspective (since, eg, the overhead of making 500x the # of requests may eat up the time saved from not transferring/decompressing 99% of the file). I wanted to verify I wasn't just shifting costs from myself onto CC before going down that path.

thanks!

Colin

Gordon V. Cormack

unread,

Apr 27, 2019, 7:44:54 AM4/27/19

to Common Crawl

I have the same question. I have used 500 concurrent requests to AWS, each for 100 fragments of different warcs (the max that AWS will serve with HTTP keep-alive), for a total of 50,000 pages. That takes less than a minute.

I have some fear that if I were to run this continuously, Amazon would be upset at me. But I haven't been able to find any sort of capacity limits/terms of service.

Sebastian Nagel

unread,

Apr 29, 2019, 7:19:41 AM4/29/19

to Common Crawl

Hi Collin, hi Gordon,

> I know it's possible, but is it
> acceptable to make tens of million of requests?

> I wanted to verify I wasn't just shifting costs from
> myself onto CC before going down that path.

Thanks for asking. All costs for hosting the data and
data transfer are paid by Amazon as part of the Open Data program [1,2]. Thanks!
Note that some other data set buckets use a "requester pays" policy.

Indeed, fetching 25 million WARC records would cost $10 according to [1] ("$0.0004 per 1,000
requests"). However, 25 million WARC records occupy about 500 GB of storage (the March 2019 crawl
has 50 TiB WARC files containing 2.55 billion page captures), with $0.021 per GB (S3 Standard
Storage, Over 500 TB / Month) hosting the 500 GB also costs $10 per month. Unused "dead" data is
useless, so I would definitely encourage you to use it.

> This feels inefficient - 99% of URLs are removed by the URL/language filter.
> I could instead use the CDX or parquet indexes to retrieve the set of pages that meet
> the heuristic, and then make a request for each of them individually.

That's why we provide the indexes. So, use them. Even better if you can avoid useless data
transfers by picking only the needed data, esp. when accessing the data from remote.

> I haven't yet done the work to assess if this is a net win in speed/cost from my
> perspective (since, eg, the overhead of making 500x the # of requests may eat up the
> time saved from not transferring/decompressing 99% of the file).

Yes, there is definitely break-even point where it becomes cheaper and faster
to process the data sequentially and skip over unneeded records. My guess would be
that the break-even is above 1% and may even reach 10% if the data is processed in the us-east-1
region. But I haven't run any experiments to prove this. Also note that a S3 client library with
support for multi-part downloads may send 5-10 million requests to fetch all WARC files of a single
month with a default chunk size of 8 MB for boto3 and AWS CLI [4,5].

If I get any information on this I'll share it. If anybody has already run experiments
please let us know!

Best and thanks,
Sebastian

[1] https://aws.amazon.com/opendata/
[2] https://registry.opendata.aws/
[3] https://aws.amazon.com/s3/pricing/?nc=sn&loc=4
[4] https://boto3.amazonaws.com/v1/documentation/api/latest/_modules/boto3/s3/transfer.html
[5] https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-chunksize

On 4/27/19 1:23 PM, Colin Dellow wrote:
> Hello list,
>
> I have a non-technical question about request rates and limits: I know it's possible, but is it
> acceptable to make tens of million of requests?
>
> For context, I have some code that classifies domains. To start with, I'm happy to only consider
> English-language home pages of each domain, as determined by a rough heuristic (eg root URL,
> /index.html, /index.htm, etc) plus the pre-existing language classifications. My code currently
> processes entire WARC files, filtering out URLs that don't match this filter before applying the
> more expensive classifier function.
>
> This feels inefficient - 99% of URLs are removed by the URL/language filter. I could instead use the
> CDX or parquet indexes to retrieve the set of pages that meet the heuristic, and then make a request
> for each of them individually.
>
> However, because S3 doesn't support multiple ranges in the Range header, I'd have to make exactly 1
> request per URL, so 20-30 million requests for a single crawl. That amount is almost 1% of the
> requests that were made globally in November 2018 (as detailed by Sebastian here

> <https://groups.google.com/d/msg/common-crawl/qLwByLGRxjQ/wiNU3xyZBwAJ>). At S3's posted rack rates,

> it'd be ~$10 in request costs alone. If I then did this for multiple crawls, or for different
> classifiers, I'd become a noticeable blip in the Common Crawl's usage data, which gives me pause. I
> think Amazon provides free (or subsidized?) hosting as part of the public datasets program, so the
> costs are somewhat irrelevant, but maybe Amazon has given some guidance?
>
> I haven't yet done the work to assess if this is a net win in speed/cost from my perspective (since,
> eg, the overhead of making 500x the # of requests may eat up the time saved from not
> transferring/decompressing 99% of the file). I wanted to verify I wasn't just shifting costs from
> myself onto CC before going down that path.
>
> thanks!
> Colin
>

> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Colin Dellow

unread,

Apr 29, 2019, 11:47:45 AM4/29/19

to common...@googlegroups.com

Great, thank you!

I ran a brutal test with bash and a bunch of forked curl processes (so no benefit of reusing connections) and it convinced me that multiple requests is definitely more efficient for my use case of 1%.

It looks like it would hold up to about 5% of the file.

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Sebastian Nagel

unread,

Apr 29, 2019, 11:53:03 AM4/29/19

to common...@googlegroups.com

Thanks for the information, Colin!

Just for clarification: the test has been run on an EC2 instance in the us-east-1 region?

> > common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>
> <mailto:common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>>.

> > To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>

> > <mailto:common...@googlegroups.com <mailto:common...@googlegroups.com>>.

> > Visit this group at https://groups.google.com/group/common-crawl.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>.

Colin Dellow

unread,

Apr 29, 2019, 11:59:17 AM4/29/19

to common...@googlegroups.com

Yes, on an a1.large in us-east-1, so it likely understates the benefit, as it seems like sizes less than the largest (in this case, an a1.4xlarge) often underperform by more than just a straight scaling factor due to whatever reason.

I'll likely implement this into my framework more properly in the next month. If I do, I'll provide a more detailed report to the list.

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Reply all

Reply to author

Forward