Hi Davood,
the code to fetch the WARC records looks very reasonable.
The only way to speed it up:
- try concurrent requests (this should be possible up to a
certain limit)
- maybe shuffle the list of WARC records beforehand, so that
different path prefixes are intermixed
- of course, most effective would be to move the computation
closer to the data, ideally into AWS us-east-1
One point: depending whether there are many records for a domain
you'd need to iterate over result pages, see [1].
Best,
Sebastian
[1]
https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html#pagination-api
On 8/26/22 16:38, Davood Hadiannejad wrote:
> Hi Sebastian,
> thank you for your reply. I have some domain lists Like: domains =
> ["
rtl.de <
http://rtl.de>", "
bunte.de <
http://bunte.de>", "
gala.de
> <
http://gala.de>"] . first, I do a query the corresponding index to each
> crawl listed in index_list = ['2022-05', '2021-49', '2021-43',
> '2021-39', '2021-31'] for the domain like:
> image.png
>
> this returns at the end: a list of records that contains URLs and meta
> informations corresponding to the domain.
> Then I try to download the records from commoncreal like:
>
> <mailto:
common-crawl%2Bunsu...@googlegroups.com>.
> <
https://groups.google.com/d/msgid/common-crawl/eea5a198-628d-4b1c-3a89-4eb57c01b728%40commoncrawl.org>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/CAGSSPFaKWS8Dve2oX7730Yw5x7%3D2yPBzE-C%3DUc0xwKpLb%2BrDxA%40mail.gmail.com
> <
https://groups.google.com/d/msgid/common-crawl/CAGSSPFaKWS8Dve2oX7730Yw5x7%3D2yPBzE-C%3DUc0xwKpLb%2BrDxA%40mail.gmail.com?utm_medium=email&utm_source=footer>.