Hi Rens,
> We're doing range requests and they are indeed more efficient than my intuition, nice :)
Thanks for the notice. Indeed, the performance of the S3 servers responding to many small requests is amazing.
> we're doing some parts in Spark
I hope you didn't miss that you can fire the range requests from Spark. Examples available for Python/PySpark [1,2] and Java [3].
Of course, if you have a complex pipeline afterwards to process the data (or even multiple ones) which need a lot of tweaking or iterations,
it can be better to store the extracted subset. Then it doesn't
matter which tool you use.
Best,
Sebastian
[1]
https://github.com/commoncrawl/cc-pyspark/blob/9b1fe7955caa9bea99d36829f75a5fcbd71f0d9a/sparkcc.py#L375
[2]
https://github.com/commoncrawl/cc-pyspark/blob/master/cc_index_word_count.py
[3]
https://github.com/commoncrawl/cc-index-table/#export-subsets-of-the-common-crawl-archives
On 2/12/21 12:15 AM, Rens Oliemans wrote:
> Hi Sebastian,
>
> Thanks for the helpful reply! We're doing range requests and they are indeed more efficient than my intuition, nice :)
> We looked into CDX-toolkit in the beginning but since we're doing some parts in Spark this also came with some disadvantages, right no we're
> doing it more or less manually, but that isn't too complex really.
> We're now well on our way, thanks for CC, it's awesome.
>
> Best,
> Rens
>
> On Thursday, January 28, 2021 at 11:34:39 AM UTC+1 Sebastian Nagel wrote:
>
> Hi Rens,
>
> > it seems really inefficient to ask the s3 server to get me (a part of) a .warc.gz file
>
> No, sending range requests to S3 is actually quite efficient.
>
> Just in case you've missed it: using the WARC file name, record offset and length,
> you can send a range request to pick only the records selected from the index. See:
>
https://groups.google.com/g/common-crawl/c/iZVW5ai9jQI/m/9RKQll_lAQAJ
> <
https://groups.google.com/g/common-crawl/c/iZVW5ai9jQI/m/9RKQll_lAQAJ>
>
https://groups.google.com/g/common-crawl/c/7nVhY9D1qa0/m/QMAUukYsCAAJ
> Use Greg's cdx-toolkit (
https://pypi.org/project/cdx-toolkit/ <
https://pypi.org/project/cdx-toolkit/>) to query the CDX index and write
> all matched records into a WARC file - see the command "warc".
>
> You might also have a look at the columnar index to filter irrelevant pages away already in the query:
>
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> <
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>
> If yes, there's a Spark job which extracts records via the columnar index and stores them in WARCs:
>
https://github.com/commoncrawl/cc-index-table/#export-subsets-of-the-common-crawl-archives
> <
https://github.com/commoncrawl/cc-index-table/#export-subsets-of-the-common-crawl-archives>
>
>
> Best,
> Sebastian
>
>
> On 1/27/21 9:47 PM, Rens Oliemans wrote:
> > Hi everyone! I have been trying to download WARC files for a lot of web pages from a specific TLD, but am not sure how to proceed.
> > Let's take .fr for example for 2020-50, then my approach was as follows:
> > grep '^fr' cluster-2020-50.idx | cut -f 2 | uniq
> > > cdx-00190.gz, cdx-00191.gz, cdx-00192.gz, cdx-00193.gz, cdx-00194.gz
> >
> > However, when I take a look at these .gz files, I see a lot of domains with the following structure:
> > "urlinfo, mime, url, length, offset, ..., *filename*"
> > where filename is something like "crawl-data/CC-MAIN-2020-50/segments/xxx/warc/CC-MAIN-2020...-...warc.gz"
> >
> > But I want to get /a lot/ of WARC files for a spefic domain, so it seems really inefficient to ask the s3 server to get me (a part of) a
> > .warc.gz file for each domain.
> > Rather, I'd like to get a huge .warc.gz file and process it locally, filtering the irrelevant pages. How would I go about doing this?
> >
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/ea933d34-2db5-474c-924d-a80578b1f75fn%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/ea933d34-2db5-474c-924d-a80578b1f75fn%40googlegroups.com?utm_medium=email&utm_source=footer>.