download failed error(104, 'Connection reset by peer')

Doron Shem-Tov

unread,

Apr 14, 2019, 9:43:03 AM4/14/19

to Nextflow

Hello nextflow community,

When running my variant calling workflow on multiple samples (>100) on aws-batch, I get quite frequent errors in the form:

download failed: s3://<my-file>.fq.gz to ./<my-file>.fq.gz ("Connection broken: error(104, 'Connection reset by peer')", error(104, 'Connection reset by peer'))

Usually, in the retries the proceses succeeds.

If I understand correctly this happens due to some limit on the number of concurrent reads from s3.

What is the correct way to handle this issue?

Thanks in advance!

Doron Shem-Tov

unread,

Apr 15, 2019, 11:59:22 AM4/15/19

to Nextflow

Seems like increasing aws.maxConnections solves the problem.

Doron Shem-Tov

unread,

Apr 21, 2019, 7:19:15 AM4/21/19

to Nextflow

Problem still occurs occasionally, even after increasing maxConnections to 10,000.

Anyone has an idea what is the source of this problem, and how to fix it?

Thanks!

Doron Shem-Tov

unread,

May 1, 2019, 2:59:44 AM5/1/19

to Nextflow

Update on this issue:

1. increasing maxConnections with awsbatch backend has no real effect.

2. I have contacted aws support, and they suggested this is related to network congestion and we should limit the number of s3 downloads per machine instance.
We have followed this suggestion (configuring queue with small machines, such that #process <= 8 per machine), and it seems to have improved the situation markedly.

Paolo Di Tommaso

unread,

May 2, 2019, 4:32:12 AM5/2/19

to nextflow

Interesting, we coukd add config setting to limit the max number of downloads.

p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Doron Shem-Tov

unread,

May 2, 2019, 4:54:18 AM5/2/19

to Nextflow

That could be very helpfull Paolo,

I was thinking that changing the maxCpus in the nxf_parallel function in .command.run scripts could help reduce the load.

nxf_parallel() {

local cmd=("$@")

local cpus=$(nproc 2>/dev/null || < /proc/cpuinfo grep '^process' -c)

local max=$(if (( cpus>16 )); then echo 16; else echo $cpus; fi)

local i=0

local pid=()

(

set +u

while ((i<${#cmd[@]})); do

local copy=()

for x in "${pid[@]}"; do

[[ -e /proc/$x ]] && copy+=($x)

done

pid=("${copy[@]}")

if ((${#pid[@]}>=$max)); then

sleep 1

else

eval "${cmd[$i]}" &

pid+=($!)

((i+=1))

fi

done

((${#pid[@]}>0)) && wait ${pid[@]}

)

}

Is this is the value you said you can make configurable?

Thanks!

Doron

To unsubscribe from this group and stop receiving emails from it, send an email to next...@googlegroups.com.

Paolo Di Tommaso

unread,

May 2, 2019, 4:55:45 AM5/2/19

to nextflow

Yes.

Reply all

Reply to author

Forward