Bulk download

104 views

Skip to first unread message

Bárbara Castro

unread,

Dec 6, 2022, 7:42:52 PM12/6/22

to Common Crawl

Is there a way to bulk download all the pages of a website such as www.bcc.com, and not have to make a request per page of the site? The last way takes me longer and when I have downloaded 10,000 documents, it throws a timeout error. I tried many times like this and I could never download all the documents I needed.

Sebastian Nagel

unread,

Dec 7, 2022, 4:03:35 AM12/7/22

to common...@googlegroups.com

Hi Bárbara,

do you mean to download the WARC records or only query the URL index
to know which URLs from a site were visited by the crawler?

If a timeout happens (and this is reproducible) it's recommended to
split the request into smaller parts and repeat the parts which failed
on the first trial.

The CDX server (index.commoncrawl.org) has a pagination API [1].
There are clients (eg. cdx-toolkit [2]) which handle the iteration
of pages.

If the bulk lookup includes more than short list of URLs or host/domain
names and/or complex filter patterns, using the columnar index may be
more efficient. See [3] for some examples.

If the issue affects a specific method or tool, could you share some
more details what commands are run? If possible also the error message
or stack trace. Thanks!

Best,
Sebastian

[1]
https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html#pagination-api
[2] https://pypi.org/project/cdx-toolkit/
[3]
https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb

Bárbara Castro

unread,

Dec 7, 2022, 9:20:40 AM12/7/22

to common...@googlegroups.com

Hi Sebastian!

I mean to download all WARC documents associated with the URLs of a specific web site at once. I am currently using these tools: cdx-index-client [1] to get the URLs of the WARC documents, and commoncrawl-warc-retrieval [2] to download them.

But this process takes a long time because it makes a request for each URL, and there comes the timeout problem. Even if there was no such problem, just making one request per URL takes more time than expected. So my question was if there is a way to download all WARC documents from a specific website at once without having to make a request per site URL.

I will try to replicate my process one more time so I can send the stack trace for all of them.

Thank you very much, Bárbara

[1] https://github.com/ikreymer/cdx-index-client

[2] https://github.com/lxucs/commoncrawl-warc-retrieval

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/xyY9N2t4y88/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f3554d72-eb25-781c-d961-ee3bd9c7391a%40commoncrawl.org.

Bárbara Castro Jerez
Estudiante de Ingeniería Civil en Computación
Universidad de Chile

Sebastian Nagel

unread,

Dec 7, 2022, 10:12:25 AM12/7/22

to common...@googlegroups.com

Hi Bárbara,

> if there is a way to download all WARC documents from
> a specific website at once without having to make a request per site
> URL.

No. One request is required per WARC record. The S3 API does not support
multiple ranges in range requests.

> I will try to replicate my process one more time so I can send the
> stack trace for all of them.

Ok.

Best,
Sebastian

On 12/7/22 15:20, Bárbara Castro wrote:
> Hi Sebastian!
>
> I mean to download all WARC documents associated with the URLs of a
> specific web site at once. I am currently using these tools:
> cdx-index-client [1] to get the URLs of the WARC documents, and
> commoncrawl-warc-retrieval [2] to download them.
>
> But this process takes a long time because it makes a request for each
> URL, and there comes the timeout problem. Even if there was no such
> problem, just making one request per URL takes more time than expected.
> So my question was if there is a way to download all WARC documents from
> a specific website at once without having to make a request per site URL.
>
> I will try to replicate my process one more time so I can send the stack
> trace for all of them.
>
> Thank you very much, Bárbara
>
> [1] https://github.com/ikreymer/cdx-index-client
> <https://github.com/ikreymer/cdx-index-client>
> [2] https://github.com/lxucs/commoncrawl-warc-retrieval
> <https://github.com/lxucs/commoncrawl-warc-retrieval>
>
> El mié, 7 dic 2022 a las 6:03, Sebastian Nagel

> (<seba...@commoncrawl.org <mailto:seba...@commoncrawl.org>>) escribió:

>
> Hi Bárbara,
>
> do you mean to download the WARC records or only query the URL index
> to know which URLs from a site were visited by the crawler?
>
> If a timeout happens (and this is reproducible) it's recommended to
> split the request into smaller parts and repeat the parts which failed
> on the first trial.
>
> The CDX server (index.commoncrawl.org

> <http://index.commoncrawl.org>) has a pagination API [1].

> There are clients (eg. cdx-toolkit [2]) which handle the iteration
> of pages.
>
> If the bulk lookup includes more than short list of URLs or host/domain
> names and/or complex filter patterns, using the columnar index may be
> more efficient. See [3] for some examples.
>
> If the issue affects a specific method or tool, could you share some
> more details what commands are run? If possible also the error message
> or stack trace. Thanks!
>
> Best,
> Sebastian
>
> [1]

> https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html#pagination-api <https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html#pagination-api>

> [2] https://pypi.org/project/cdx-toolkit/
> <https://pypi.org/project/cdx-toolkit/>
> [3]

> https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb <https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb>

>
> On 12/7/22 01:42, Bárbara Castro wrote:
> > Is there a way to bulk download all the pages of a website such as

> > www.bcc.com <http://www.bcc.com>, and not have to make a request

> per page of the site? The
> > last way takes me longer and when I have downloaded 10,000
> documents, it
> > throws a timeout error. I tried many times like this and I could
> never
> > download all the documents I needed.
> >
>
> --
> You received this message because you are subscribed to a topic in
> the Google Groups "Common Crawl" group.
> To unsubscribe from this topic, visit

> https://groups.google.com/d/topic/common-crawl/xyY9N2t4y88/unsubscribe <https://groups.google.com/d/topic/common-crawl/xyY9N2t4y88/unsubscribe>.

> To unsubscribe from this group and all its topics, send an email to
> common-crawl...@googlegroups.com

> <mailto:common-crawl%2Bunsu...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/common-crawl/f3554d72-eb25-781c-d961-ee3bd9c7391a%40commoncrawl.org <https://groups.google.com/d/msgid/common-crawl/f3554d72-eb25-781c-d961-ee3bd9c7391a%40commoncrawl.org>.

>
>
>
> --
> Bárbara Castro Jerez
> Estudiante de Ingeniería Civil en Computación

> *Universidad de Chile*
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/common-crawl/CAD%2BR888ma0m58pTX6wB23662AvF9pE0%3DS6iKAo5%3DNz7nrcP8yQ%40mail.gmail.com <https://groups.google.com/d/msgid/common-crawl/CAD%2BR888ma0m58pTX6wB23662AvF9pE0%3DS6iKAo5%3DNz7nrcP8yQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Greg Lindahl

unread,

Dec 7, 2022, 5:39:50 PM12/7/22

to common...@googlegroups.com

Hi Bárbara,

I'm the author of cdx_toolkit, and if you compare the code of
cdx_toolkit with commoncrawl-warc-retrieval, you'll notice that
cdx_toolkit is a much more robust tool. cdx_toolkit would retry that
timeout, for example. But it will still be slow to fetch the WARC
records, because (as Sebastian says) every request does have to be
sent separately.

-- greg

> *Universidad de Chile*
>
> --

> You received this message because you are subscribed to the Google Groups "Common Crawl" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CAD%2BR888ma0m58pTX6wB23662AvF9pE0%3DS6iKAo5%3DNz7nrcP8yQ%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages