> If I may suggest, a possibly better (general) approach would be to add
> retries around all S3 operations.
Spark uses the Hadoop file system APIs and, AFAIK, the NativeS3FileSystem
has retry logic built in to it--it reads the "fs.s3.maxRetries" conf
parameter (default of 4) and then uses a "RetryProxy" class to retry on
IOExceptions and S3Exceptions.
I was looking at the Hadoop 1.0.3 source code; no idea if those semantics
change across versions of Hadoop.
So you could potentially try increasing fs.s3.maxRetries, also it seems
odd that you'd get a timeout after 4 retries. Assuming you're running
in EC2; I suppose if you're running out of AWS it would be more likely.
- Stephen