Hi Paolo,
I ran into an issue with my Nextflow AWS Batch workflow where I was getting errors like this for certain AWS Batch tasks:
CannotInspectContainerError: Could not transition to inspecting; timed out after waiting 30s
I was logged onto the instance that the jobs were running on when this happened and the only thing happening on the system were the aws cli calls (about 17 of them pulling down fairly large - like ~5Gb - files into the Docker containers). We have 1000Gb of EBS mounted storage setup as described in the docs here:
The aws cli is installed on the AMI (as opposed to in the Docker containers).
Sure enough, as these 17 aws cli calls were happening on the system I could not get basic docker commands registed (i.e. docker ps just stalled) so it seems that AWS Batch is trying to register basic docker commands and it's timing out.
I am wondering if anyone ever ran into this issue using AWS Batch with Nextflow and if so how the problem was resolved. There are some environment variables that can tweak the timeout behavior here:
Specifically, I thought I would adjust ECS_CONTAINER_START_TIMEOUT (default= 3 minutes) and ECS_CONTAINER_STOP_TIMEOUT (default=30 seconds). But I'm also wondering if the aws cli calls are really freezing things and how to maybe prevent docker from grinding to a hault.
Thanks,
Stephen