Hi Meghana, hi Jay,
> I am using minPartitions as 1000 to process only 72 warc files per
> task.
> I have a feeling that spark context is shutting down because of memory
> error.
The reason of the error should be logged in the (driver) logs, see [1,2].
> I was getting error while keeping minPartitions as 400 as well.
> Is there some recommended number?
In order to avoid that compute resources are spinning without work,
you would typically choose minPartitions as a multiple of the max.
number of concurrent task your cluster is able to run - basically, it's
num_executors * cores_per_executor.
However, I'd recommended not to try to process the WARC files of an
entire monthly crawl in a single Spark job. Instead, split the data
into 10 parts and write the result as an intermediate output to S3.
Later combine the parts into your final result:
1. web data is unpredictable and may trigger rare exceptions.
My own experience: you think all exceptions are handled but
then after a billion of records you hit an unexpected one.
Then not the entire job is lost but only the current part.
2. this would allow you to use cheap EC2 spot instances.
> the maximum limit of a S3 bucket (is it still 5500
> requests/second?).
The limit does apply per prefix on a bucket, see [3].
Best,
Sebastian
[1]
https://aws.amazon.com/premiumsupport/knowledge-center/spark-driver-logs-emr-cluster/
[2]
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html
[3]
https://aws.amazon.com/premiumsupport/knowledge-center/s3-503-within-request-rate-prefix/
> <
https://groups.google.com/d/msgid/common-crawl/2545cef3-74d1-9d83-3397-ef9a952cdf78%40commoncrawl.org>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/250d43cb-889d-4b80-b6e4-f824016b75e7n%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/250d43cb-889d-4b80-b6e4-f824016b75e7n%40googlegroups.com?utm_medium=email&utm_source=footer>.