download failed error(104, 'Connection reset by peer')

245 views
Skip to first unread message

Doron Shem-Tov

unread,
Apr 14, 2019, 9:43:03 AM4/14/19
to Nextflow
Hello nextflow community,

When running my variant calling workflow on multiple samples (>100) on aws-batch, I get quite frequent errors in the form:
download failed: s3://<my-file>.fq.gz to ./<my-file>.fq.gz ("Connection broken: error(104, 'Connection reset by peer')", error(104, 'Connection reset by peer'))

Usually, in the retries the proceses succeeds.

If I understand correctly this happens due to some limit on the number of concurrent reads from s3.
What is the correct way to handle this issue?

Thanks in advance!

Doron Shem-Tov

unread,
Apr 15, 2019, 11:59:22 AM4/15/19
to Nextflow
Seems like increasing aws.maxConnections solves the problem.

Doron Shem-Tov

unread,
Apr 21, 2019, 7:19:15 AM4/21/19
to Nextflow
Problem still occurs occasionally, even after increasing maxConnections to 10,000.
Anyone has an idea what is the source of this problem, and how to fix it?
Thanks!

Doron Shem-Tov

unread,
May 1, 2019, 2:59:44 AM5/1/19
to Nextflow
Update on this issue:
1. increasing maxConnections with awsbatch backend has no real effect.
2. I have contacted aws support, and they suggested this is related to network congestion and we should limit the number of s3 downloads per machine instance.
    We have followed this suggestion (configuring queue with small machines, such that #process <= 8 per machine), and it seems to have improved the situation markedly.

Paolo Di Tommaso

unread,
May 2, 2019, 4:32:12 AM5/2/19
to nextflow
Interesting, we coukd add config setting to limit the max number of downloads. 

p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Doron Shem-Tov

unread,
May 2, 2019, 4:54:18 AM5/2/19
to Nextflow
That could be very helpfull Paolo,
I was thinking that changing the maxCpus in the nxf_parallel function in .command.run scripts could help reduce the load.

nxf_parallel() {
    local cmd=("$@")
    local cpus=$(nproc 2>/dev/null || < /proc/cpuinfo grep '^process' -c)
    local max=$(if (( cpus>16 )); then echo 16; else echo $cpus; fi)
    local i=0
    local pid=()
    (
    set +u
    while ((i<${#cmd[@]})); do
        local copy=()
        for x in "${pid[@]}"; do
          [[ -e /proc/$x ]] && copy+=($x) 
        done
        pid=("${copy[@]}")

        if ((${#pid[@]}>=$max)); then 
          sleep 1 
        else 
          eval "${cmd[$i]}" &
          pid+=($!)
          ((i+=1))
        fi 
    done
    ((${#pid[@]}>0)) && wait ${pid[@]}
    )
}

Is this is the value you said you can make configurable?

Thanks!
Doron

To unsubscribe from this group and stop receiving emails from it, send an email to next...@googlegroups.com.

Paolo Di Tommaso

unread,
May 2, 2019, 4:55:45 AM5/2/19
to nextflow
Yes. 



Reply all
Reply to author
Forward
0 new messages