error (409) copying large file (1.7TB) to azure blob storage using publishDir

114 views
Skip to first unread message

Kobe Lavaerts

unread,
Jan 28, 2022, 9:53:03 AM1/28/22
to Nextflow
Hi all,

I was wondering if someone here could help me with a problem I'm having.

In our current pipeline we run the pipeline locally but want to write some of our output files together with the process .command.* log files and the report, timeline and trace files to an azure blob storage container.
This works for the most part except for one file.
We have one process that creates a compressed tar.gz archive of intermediate output files. This is a relatively big file of 1.7 TB. When trying to publish this file to the blob storage, nextflow trows the following error (found in .nextflow.log)

error:
Jan-27 19:01:49.779 [FileTransfer-thread-67] ERROR c.a.s.common.StorageOutputStream - com.azure.storage.blob.models.B
lobStorageException: Status code 409, "<U+FEFF><?xml version="1.0" encoding="utf-8"?><Error><Code>BlockCountExceedsLi
mit</Code><Message>The uncommitted block count cannot exceed the maximum limit of 100,000 blocks.
RequestId:1dbd201e-a01e-0035-6db0-132d0d000000
Time:2022-01-27T19:01:49.7131543Z</Message></Error>"

setup:
- script is executed locally
- azure storage account is set up in nextflow.config 
- publishDir directive is used to send files to blob container

nextflow.config file:
profiles {
azure {
azure {
storage {
accountName = params.azstorage
sasToken = params.azsastoken
}
}
// specify the output folders
params.tostoredir = "az://" + params.azblob + "/ToStore"
params.tosenddir = "az://" + params.azblob + "/ToSend"
params.logdir = params.tostoredir + "/log"
// Save the run log files in azure blob storage
report {
enabled = true
file = "az://${params.azblob}/ToStore/log/${params.runname}-report.html"
}
trace {
enabled = true
file = "az://${params.azblob}/ToStore/log/${params.runname}-trace.txt"
}
timeline {
enabled = true
file = "az://${params.azblob}/ToStore/log/${params.runname}-timeline.html"
}
}
}

nextflow process:
process create_archive{
publishDir "${params.tosenddir}", pattern: "*.{gz,txt}", mode: "copy"
publishDir "${params.logdir}/${task.process}/${task.hash}", pattern: ".*", mode: "copy"

errorStrategy "retry"

input:
val(project_name)
val(demuxdone)

output:
tuple val(project_name), file("**")
path(".*")

script:
"""
tar --use-compress-program="pigz -p ${task.cpus}" -cf ${params.runname}-RawData-${project_name}.tar.gz -C ${params.rawdata} ${project_name}
md5sum ${params.runname}-RawData-${project_name}.tar.gz > ${params.runname}-RawData-${project_name}.md5sum.txt
"""
}

All other output files and log files of other processes are successfully published to the azure blob container.

We thought maybe the file was to big, but when we try to upload the file manually to the same azure blob container using azcopy we don't get this error and the file is successfully uploaded.

As I'm not clear on the reason for the error I was hoping someone could help.

Reply all
Reply to author
Forward
0 new messages