AWS Batch docker timeouts

2,581 views
Skip to first unread message

stephen mclaughlin

unread,
Jul 9, 2018, 12:04:27 PM7/9/18
to Nextflow
Hi Paolo,

   I ran into an issue with my Nextflow AWS Batch workflow where I was getting errors like this for certain AWS Batch tasks:

 CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s

   I was logged onto the instance that the jobs were running on when this happened and the only thing happening on the system were the aws cli calls (about 17 of them pulling down fairly large - like ~5Gb - files into the Docker containers).  We have 1000Gb of EBS mounted storage setup as described in the docs here:


  The aws cli is installed on the AMI (as opposed to in the Docker containers).  

  Sure enough, as these 17 aws cli calls were happening on the system I could not get basic docker commands registed (i.e. docker ps just stalled) so it seems that AWS Batch is trying to register basic docker commands and it's timing out.  

  I am wondering if anyone ever ran into this issue using AWS Batch with Nextflow and if so how the problem was resolved.  There are some environment variables that can tweak the timeout behavior here:


  Specifically, I thought I would adjust ECS_CONTAINER_START_TIMEOUT (default= 3 minutes) and ECS_CONTAINER_STOP_TIMEOUT (default=30 seconds).  But I'm also wondering if the aws cli calls are really freezing things and how to maybe prevent docker from grinding to a hault.  

Thanks,
Stephen

Francesco Strozzi

unread,
Jul 9, 2018, 12:22:59 PM7/9/18
to next...@googlegroups.com
Hi Stephen,
we are using AWS Batch and Nextflow extensively and I have never encountered this error before. I think you are right and in this case the Batch jobs downloading data from S3 on the ECS instance are saturating the network connection and the Docker command (inspect) is going in timeout.
So raising the ECS_CONTAINER_STOP_TIMEOUT can be a good idea in this particular case. Are you using the latest ECS AMI version to build the AMI for Batch ?

Cheers
--
Francesco




--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Paolo Di Tommaso

unread,
Jul 9, 2018, 12:30:29 PM7/9/18
to nextflow
There's an issue suggesting it a problem of the ECS agent 



p

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

stephen mclaughlin

unread,
Jul 9, 2018, 2:15:03 PM7/9/18
to Nextflow
Are you using the latest ECS AMI version to build the AMI for Batch ?

Yes, updated to the latest but it broke in roughly the same spot with the same error message; it's fairly reproducible.  I was able to "fix" it by upping the CPU for the processes and I suspect this is because there are fewer aws cli processes kicking off simultaneously on the same instance.  Still trying to figure out the best long term fix.  I'll try increasing the timeouts next.

Francesco Strozzi

unread,
Jul 9, 2018, 2:34:33 PM7/9/18
to next...@googlegroups.com
Yes that can be the quick fix, less jobs per instance. 

It is interesting that from the AWS side this is an issue closed since 2 years now. Wondering if it may be useful to re-open it on the github repository Paolo linked.

Just to give you a rough idea I was running today a workflow where each job downloaded ~50 GB and I had 4 jobs at max on a single large instance (m4.16xlarge). I saw no error of this type, so I guess it also depends on the network performance of the instance.

Cheers 
Francesco 

Le lun. 9 juil. 2018 à 20:15, stephen mclaughlin <sfmc...@gmail.com> a écrit :
Are you using the latest ECS AMI version to build the AMI for Batch ?

Yes, updated to the latest but it broke in roughly the same spot with the same error message; it's fairly reproducible.  I was able to "fix" it by upping the CPU for the processes and I suspect this is because there are fewer aws cli processes kicking off simultaneously on the same instance.  Still trying to figure out the best long term fix.  I'll try increasing the timeouts next.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

stephen mclaughlin

unread,
Jul 13, 2018, 3:57:26 PM7/13/18
to Nextflow
I did end up opening a ticket for this:


The problem is very high volume IO utilization metrics (close to 100%) in the instances where these tasks are run.  I was using gp2 volumes.  

But, I think it's worth mentioning that the jobs never did actually officially start.  The high IO was caused by the aws s3 cp commands staging input data into the Docker containers as a precursor to the actual jobs running.  It's even possible that this staging of the input files is more IO intensive than the actual jobs that run.

One thing that's kind of annoying about this is that the jobs are copying over data redundantly; most of the jobs are all using the same files as inputs.  

patrick....@personalis.com

unread,
Jan 28, 2019, 5:44:21 PM1/28/19
to Nextflow
Hey Stephen, did anything ever come of this? I had the same exact issue a few months ago and sort of shelved the project, because there didnt seem to be a solution do launching a large number of containers on a single instance, as the IO seemed to cause those timeouts. 

rspreafico

unread,
Jun 17, 2019, 8:32:41 PM6/17/19
to Nextflow
Same here (June 2019). Did switching to provisioned IOPS SSD solve it?

Paolo Di Tommaso

unread,
Jun 20, 2019, 10:01:52 AM6/20/19
to nextflow
Latest edge release provides a setting to limit the number of parallel downloads 

it *may* help to prevent this problem 



p

Reply all
Reply to author
Forward
0 new messages