Hi Meghana,
> Pyspark s3 dist cp
You mean EMR S3DistCp?
> But seems like it makes the compressed warc file corrupt and
> introduces compression errors and asks to recompress the file.
Recompressing WARC files using common (not WARC-specific) utilities
is dangerous as it almost certainly breaks the per-record gzip
compression.
> Do you have some good suggestion to copy crawl data to personal
> bucket.
Alternatively, you can always use the good old "hadoop distcp" [2]
to copy data between s3 buckets or at a smaller scale just the AWS CLI
[3]. But maybe there is an option to disable recompression in S3DistCp
> we get Slow Down errors while downloading data
I do not know whether "buffering" the data on a personal bucket is the
right approach to avoid such errors. Depending on your use case - what
about splitting the processing pipeline into parts and buffering the
results of a first filter or extraction process. As most use cases
require only parts of the data (only text, links or some metadata)
this might require significantly less data to buffer. And you can run
the first step steadily without overusing any resources.
If necessary, please contact me off list with any details you do not
want to share publicly.
A final question: the processing is done in the AWS us-east-1 region
where the data is located?
Best,
Sebastian
[1]
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
[2]
https://hadoop.apache.org/docs/stable/hadoop-distcp/DistCp.html
[3]
https://docs.aws.amazon.com/cli/