> We're doing range requests and they are indeed more efficient than my intuition, nice :)
Thanks for the notice. Indeed, the performance of the S3 servers responding to many small requests is amazing.
> we're doing some parts in Spark
I hope you didn't miss that you can fire the range requests from Spark. Examples available for Python/PySpark [1,2] and Java .
Of course, if you have a complex pipeline afterwards to process the data (or even multiple ones) which need a lot of tweaking or iterations,
it can be better to store the extracted subset. Then it doesn't
matter which tool you use.
On 2/12/21 12:15 AM, Rens Oliemans wrote:
> Hi Sebastian,
> Thanks for the helpful reply! We're doing range requests and they are indeed more efficient than my intuition, nice :)
> We looked into CDX-toolkit in the beginning but since we're doing some parts in Spark this also came with some disadvantages, right no we're
> doing it more or less manually, but that isn't too complex really.
> We're now well on our way, thanks for CC, it's awesome.
> On Thursday, January 28, 2021 at 11:34:39 AM UTC+1 Sebastian Nagel wrote:
> Hi Rens,
> > it seems really inefficient to ask the s3 server to get me (a part of) a .warc.gz file
> No, sending range requests to S3 is actually quite efficient.
> Just in case you've missed it: using the WARC file name, record offset and length,
> you can send a range request to pick only the records selected from the index. See:
> Use Greg's cdx-toolkit (https://pypi.org/project/cdx-toolkit/
>) to query the CDX index and write
> all matched records into a WARC file - see the command "warc".
> You might also have a look at the columnar index to filter irrelevant pages away already in the query:
> If yes, there's a Spark job which extracts records via the columnar index and stores them in WARCs:
> On 1/27/21 9:47 PM, Rens Oliemans wrote:
> > Hi everyone! I have been trying to download WARC files for a lot of web pages from a specific TLD, but am not sure how to proceed.
> > Let's take .fr for example for 2020-50, then my approach was as follows:
> > grep '^fr' cluster-2020-50.idx | cut -f 2 | uniq
> > > cdx-00190.gz, cdx-00191.gz, cdx-00192.gz, cdx-00193.gz, cdx-00194.gz
> > However, when I take a look at these .gz files, I see a lot of domains with the following structure:
> > "urlinfo, mime, url, length, offset, ..., *filename*"
> > where filename is something like "crawl-data/CC-MAIN-2020-50/segments/xxx/warc/CC-MAIN-2020...-...warc.gz"
> > But I want to get /a lot/ of WARC files for a spefic domain, so it seems really inefficient to ask the s3 server to get me (a part of) a
> > .warc.gz file for each domain.
> > Rather, I'd like to get a huge .warc.gz file and process it locally, filtering the irrelevant pages. How would I go about doing this?
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> To view this discussion on the web visit