error (409) copying large file (1.7TB) to azure blob storage using publishDir

115 views

Skip to first unread message

Kobe Lavaerts

unread,

Jan 28, 2022, 9:53:03 AM1/28/22

to Nextflow

Hi all,

I was wondering if someone here could help me with a problem I'm having.

In our current pipeline we run the pipeline locally but want to write some of our output files together with the process .command.* log files and the report, timeline and trace files to an azure blob storage container.

This works for the most part except for one file.

We have one process that creates a compressed tar.gz archive of intermediate output files. This is a relatively big file of 1.7 TB. When trying to publish this file to the blob storage, nextflow trows the following error (found in .nextflow.log)

error:

Jan-27 19:01:49.779 [FileTransfer-thread-67] ERROR c.a.s.common.StorageOutputStream - com.azure.storage.blob.models.B

lobStorageException: Status code 409, "<U+FEFF><?xml version="1.0" encoding="utf-8"?><Error><Code>BlockCountExceedsLi
mit</Code><Message>The uncommitted block count cannot exceed the maximum limit of 100,000 blocks.
RequestId:1dbd201e-a01e-0035-6db0-132d0d000000
Time:2022-01-27T19:01:49.7131543Z</Message></Error>"

setup:

- script is executed locally

- azure storage account is set up in nextflow.config

- publishDir directive is used to send files to blob container

nextflow.config file:

profiles {
  azure {
    azure {
      storage {
        accountName = params.azstorage
        sasToken = params.azsastoken
      }
    }
    // specify the output folders
    params.tostoredir = "az://" + params.azblob + "/ToStore"
    params.tosenddir = "az://" + params.azblob + "/ToSend"
    params.logdir = params.tostoredir + "/log"
    // Save the run log files in azure blob storage
    report {
      enabled = true
      file = "az://${params.azblob}/ToStore/log/${params.runname}-report.html"
    }
    trace {
      enabled = true
      file = "az://${params.azblob}/ToStore/log/${params.runname}-trace.txt"
    }
    timeline {
      enabled = true
      file = "az://${params.azblob}/ToStore/log/${params.runname}-timeline.html"
    }
  }

}

nextflow process:

process create_archive{
    publishDir "${params.tosenddir}", pattern: "*.{gz,txt}", mode: "copy"
    publishDir "${params.logdir}/${task.process}/${task.hash}", pattern: ".*", mode: "copy"

    errorStrategy "retry"

    input:
    val(project_name)
    val(demuxdone)

    output:
    tuple val(project_name), file("**")
    path(".*")

    script:
    """
    tar --use-compress-program="pigz -p ${task.cpus}" -cf ${params.runname}-RawData-${project_name}.tar.gz -C ${params.rawdata} ${project_name}
    md5sum ${params.runname}-RawData-${project_name}.tar.gz > ${params.runname}-RawData-${project_name}.md5sum.txt
    """

}

All other output files and log files of other processes are successfully published to the azure blob container.

We thought maybe the file was to big, but when we try to upload the file manually to the same azure blob container using azcopy we don't get this error and the file is successfully uploaded.