Hi Bárbara,
> if there is a way to download all WARC documents from
> a specific website at once without having to make a request per site
> URL.
No. One request is required per WARC record. The S3 API does not support
multiple ranges in range requests.
> I will try to replicate my process one more time so I can send the
> stack trace for all of them.
Ok.
Best,
Sebastian
On 12/7/22 15:20, Bárbara Castro wrote:
> Hi Sebastian!
>
> I mean to download all WARC documents associated with the URLs of a
> specific web site at once. I am currently using these tools:
> cdx-index-client [1] to get the URLs of the WARC documents, and
> commoncrawl-warc-retrieval [2] to download them.
>
> But this process takes a long time because it makes a request for each
> URL, and there comes the timeout problem. Even if there was no such
> problem, just making one request per URL takes more time than expected.
> So my question was if there is a way to download all WARC documents from
> a specific website at once without having to make a request per site URL.
>
> I will try to replicate my process one more time so I can send the stack
> trace for all of them.
>
> Thank you very much, Bárbara
>
> [1]
https://github.com/ikreymer/cdx-index-client
> <
https://github.com/ikreymer/cdx-index-client>
> [2]
https://github.com/lxucs/commoncrawl-warc-retrieval
> <
https://github.com/lxucs/commoncrawl-warc-retrieval>
>
> El mié, 7 dic 2022 a las 6:03, Sebastian Nagel
> (<
seba...@commoncrawl.org <mailto:
seba...@commoncrawl.org>>) escribió:
>
> Hi Bárbara,
>
> do you mean to download the WARC records or only query the URL index
> to know which URLs from a site were visited by the crawler?
>
> If a timeout happens (and this is reproducible) it's recommended to
> split the request into smaller parts and repeat the parts which failed
> on the first trial.
>
> The CDX server (
index.commoncrawl.org
> <
http://index.commoncrawl.org>) has a pagination API [1].
> There are clients (eg. cdx-toolkit [2]) which handle the iteration
> of pages.
>
> If the bulk lookup includes more than short list of URLs or host/domain
> names and/or complex filter patterns, using the columnar index may be
> more efficient. See [3] for some examples.
>
> If the issue affects a specific method or tool, could you share some
> more details what commands are run? If possible also the error message
> or stack trace. Thanks!
>
> Best,
> Sebastian
>
> [1]
>
https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html#pagination-api <
https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html#pagination-api>
>
https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb <
https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb>
>
> On 12/7/22 01:42, Bárbara Castro wrote:
> > Is there a way to bulk download all the pages of a website such as
> >
www.bcc.com <
http://www.bcc.com>, and not have to make a request
> per page of the site? The
> > last way takes me longer and when I have downloaded 10,000
> documents, it
> > throws a timeout error. I tried many times like this and I could
> never
> > download all the documents I needed.
> >
>
> --
> You received this message because you are subscribed to a topic in
> the Google Groups "Common Crawl" group.
> To unsubscribe from this topic, visit
>
https://groups.google.com/d/topic/common-crawl/xyY9N2t4y88/unsubscribe <
https://groups.google.com/d/topic/common-crawl/xyY9N2t4y88/unsubscribe>.
> <mailto:
common-crawl%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/f3554d72-eb25-781c-d961-ee3bd9c7391a%40commoncrawl.org <
https://groups.google.com/d/msgid/common-crawl/f3554d72-eb25-781c-d961-ee3bd9c7391a%40commoncrawl.org>.
>
>
>
> --
> Bárbara Castro Jerez
> Estudiante de Ingeniería Civil en Computación
> *Universidad de Chile*
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/CAD%2BR888ma0m58pTX6wB23662AvF9pE0%3DS6iKAo5%3DNz7nrcP8yQ%40mail.gmail.com <
https://groups.google.com/d/msgid/common-crawl/CAD%2BR888ma0m58pTX6wB23662AvF9pE0%3DS6iKAo5%3DNz7nrcP8yQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.