Compression error while copying common crawl data.

Meghana Agrawal

unread,

Aug 13, 2022, 3:12:10 PM8/13/22

to Common Crawl

Hi cc team,

First of all, thank you for building open source projects like Common Crawl.

I have a doubt. Since we get Slow Down errors while downloading data for processing from commoncrawl bucket, so I tried using Pyspark s3 dist cp for copying crawl data to personal bucket.

But seems like it makes the compressed warc file corrupt and introduces compression errors and asks to recompress the file. While that makes the file okay but that is a costly process. I tried using params like disableMultiPartUpload and increased mutipart size to 2Gb but still same error.

Do you have some good suggestion to copy crawl data to personal bucket.

Meghana

Sebastian Nagel

unread,

Aug 14, 2022, 7:23:50 AM8/14/22

to common...@googlegroups.com

Hi Meghana,

> Pyspark s3 dist cp

You mean EMR S3DistCp?

> But seems like it makes the compressed warc file corrupt and
> introduces compression errors and asks to recompress the file.

Recompressing WARC files using common (not WARC-specific) utilities
is dangerous as it almost certainly breaks the per-record gzip
compression.

> Do you have some good suggestion to copy crawl data to personal
> bucket.

Alternatively, you can always use the good old "hadoop distcp" [2]
to copy data between s3 buckets or at a smaller scale just the AWS CLI
[3]. But maybe there is an option to disable recompression in S3DistCp

> we get Slow Down errors while downloading data

I do not know whether "buffering" the data on a personal bucket is the
right approach to avoid such errors. Depending on your use case - what
about splitting the processing pipeline into parts and buffering the
results of a first filter or extraction process. As most use cases
require only parts of the data (only text, links or some metadata)
this might require significantly less data to buffer. And you can run
the first step steadily without overusing any resources.

If necessary, please contact me off list with any details you do not
want to share publicly.

A final question: the processing is done in the AWS us-east-1 region
where the data is located?

Best,
Sebastian

[1]
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
[2] https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html
[3] https://docs.aws.amazon.com/cli/

Meghana Agrawal

unread,

Aug 14, 2022, 2:29:47 PM8/14/22

to Common Crawl

Yes Sebastian, I tried using EMR s3 dist cp. I get error that you might need to use warcio recompress <file> to use this file. It might not be fully corrupted.

I was compressing using ArchiveIterator recompress. https://github.com/webrecorder/warcio/blob/master/warcio/recompressor.py

Regarding hadoop dist cp, will it provide better results, if I want to transfer to personal s3 bucket. And how can this be used in EMR cluster itself.

Our usecase needs building HTML DOM structure, and we are looking for specific text as well so we need warc files only <wat and wet wont suffice>

Yes, processing is being done in us-east-1 region only.

Thanks

Meghana

Sebastian Nagel

unread,

Aug 16, 2022, 3:03:29 AM8/16/22

to common...@googlegroups.com

Hi Meghana,

> I get error that you might need to use warcio recompress <file> to use
> this file

Yes, the error is shown because the per-record compression of the WARC
file was lost.

> I was compressing using ArchiveIterator recompress.

Again, that sounds very inefficient. And I expect it's much slower than

> Our usecase needs building HTML DOM structure, and we are looking for
> specific text as well so we need warc files only <wat and wet wont
> suffice>

Ok. Understood. I also assume that building the HTML DOM is a relatively
CPU intensive task. What is your "throughput" - eg. how many HTML pages
per minute? Is the DOM built from all WARC records or only from a selection?

> Yes, processing is being done in us-east-1 region only.

Ok. Are you focusing on WARC files of a single crawl only?

Best,
Sebastian

> [3] https://docs.aws.amazon.com/cli/ <https://docs.aws.amazon.com/cli/>

>
> On 8/13/22 21:12, Meghana Agrawal wrote:
> > Hi cc team,
> > First of all, thank you for building open source projects like
> Common Crawl.
> > I have a doubt. Since we get Slow Down errors while downloading
> data for
> > processing from commoncrawl bucket, so I tried using Pyspark s3
> dist cp
> > for copying crawl data to personal bucket.
> > But seems like it makes the compressed warc file corrupt and
> introduces
> > compression errors and asks to recompress the file. While that
> makes the
> > file okay but that is a costly process. I tried using params like
> > disableMultiPartUpload and increased mutipart size to 2Gb but
> still same
> > error.
> > Do you have some good suggestion to copy crawl data to personal
> bucket.
> > Meghana
> >
>

> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/46efee19-f925-4e15-a1d4-4918bcf70d5fn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/46efee19-f925-4e15-a1d4-4918bcf70d5fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Meghana Agrawal

unread,

Aug 16, 2022, 9:05:46 AM8/16/22

to Common Crawl

Hi Sebastian,

Yes you are correct. warcio recompress is very slow. But after your suggestion, I used hadoop distcp and it was not giving any compression error.

I am building DOM parser for select pages after checking content of the html and end up building it for 1/3rd of all pages.

I have timeout on processing of each page which is 4 sec.

Mostly computation of each page completes within 0.5sec, but have timeout to avoid continued computation.

Yes, currently focusing on processing pages of one crawl only.

Meghana

Sebastian Nagel

unread,

Aug 19, 2022, 3:38:56 AM8/19/22

to common...@googlegroups.com

Hi Meghana,

> I used hadoop distcp and it was not giving any compression error.

Ok. Great!

> I have timeout on processing of each page which is 4 sec.
> Mostly computation of each page completes within 0.5sec, but have
> timeout to avoid continued computation.

Is a single WARC file processed as a "stream"? If yes, buffering every
WARC file locally (and temporarily) might improve the situation, given
that the processing of one WARC file takes some time and the limiting
factor are the number of open requests to S3. Downloading a WARC file
(about 1 GiB) within us-east-1 to a local file should usually take only
several seconds.

> Yes, currently focusing on processing pages of one crawl only.

Shuffling the input (if possible) might also be worth a trial. Then
concurrently processed WARC files share less path prefixes in common.

Best,
Sebastian

> <https://docs.aws.amazon.com/cli/> <https://docs.aws.amazon.com/cli/

Reply all

Reply to author

Forward