Split by chromosome analysis

382 views
Skip to first unread message

Lavi Bharath

unread,
Apr 17, 2018, 8:07:44 AM4/17/18
to Nextflow
Hi,
Can someone share your experiences if you have already tried this.
We use Aws batch along with docker configuration for this analysis

I am trying run gatk HaplotypeCaller per chromosome. Workflow works if I try to launch one process per chromosome as below.
Enter code here...
sample_bam = Channel
    .from ( samples_keys )
    .map {
    [it, file(params.sample_bam_map[it]) ] }

sample_bai = Channel
    .from ( samples_keys )
    .map {
    [it, file(params.sample_bam_map[it]) + ".bai" ] }


process gatk_hc {
    tag "region $region_no for sample $sample_key"
    cpus 4
    memory '8 GB'
    errorStrategy 'finish'
    input:
        set sample_key, file("${sample_key}.bqsr.bam") from sample_bam
        set sample_key, file("${sample_key}.bqsr.bam.bai") from sample_bai
        each region_list from params['references']['region_clusters']
        file(genome)
        file(genome_index)
        file(genome_dict)
        file(dbsnp)
        file(dbsnp_index)
    output:
        set sample_key, region_no, file("reg-${region_no}.bed"), file("reg-${region_no}.g.vcf.gz"), file("reg-${region_no}.g.vcf.gz.tbi") into region_gvcf_ch1
    script:
        bed_str = region_list.join("\n").replace(":", "\t").replace("-", "\t")
        region_no = generateMD5_A(region_list.toString())[0..8]
        """
        echo "${bed_str}" > reg-${region_no}.bed;
        gatk-launch --java-options "-Xmx8G -XX:ConcGCThreads=${task.cpus} -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=${task.cpus}" HaplotypeCaller -R ${genome} --dbsnp ${dbsnp} -I ${sample_key}.bqsr.bam -L reg-${region_no}.bed --emit-ref-confidence GVCF -O reg-${region_no}.g.vcf.gz
        """
}


In this case every chromosome will try to download the bam (whole genome) and works perfect. 
In order to reduce the time taken to download the entire bam file data S3 (many times actually for each chromosome), we tried pulling only part of bam for each chromosome as below.

sample_bam = Channel
    .from ( samples_keys )
    .map {
    [it, (params.sample_bam_map[it]) ] }

sample_bai = Channel
    .from ( samples_keys )
    .map {
    [it, (params.sample_bam_map[it]) + ".bai" ] }

process gatk_hc {
    tag "region $region_no for sample $sample_key"
    cpus 4
    memory '4 GB'
    errorStrategy 'finish'
    cache 'deep'
    input:
        set sample_key, val(sample) from sample_bam
        set sample_key, val(sample_bai) from sample_bai
        each region_list from params['references']['region_clusters']
        file(genome)
        file(genome_index)
        file(genome_dict)
        file(dbsnp)
        file(dbsnp_index)
    output:
        set sample_key, region_no, file("reg-${region_no}.bed"), file("reg-${region_no}.g.vcf.gz"), file("reg-${region_no}.g.vcf.gz.tbi") into region_gvcf_ch1
    script:
        bed_str = region_list.join("\n").replace(":", "\t").replace("-", "\t")
        region_no = generateMD5_A(region_list.toString())[0..8]
        """
        echo "${bed_str}" > reg-${region_no}.bed;
        samtools view -b -L reg-${region_no}.bed ${sample} > ${sample_key}.bqsr.reg-${region_no}.bam;
        samtools index ${sample_key}.bqsr.reg-${region_no}.bam;
        gatk-launch --java-options "-Xmx8G -XX:ConcGCThreads=${task.cpus} -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=${task.cpus}" HaplotypeCaller -R ${genome} --dbsnp ${dbsnp} -I ${sample_key}.bqsr.reg-${region_no}.bam -L reg-${region_no}.bed --emit-ref-confidence GVCF -O reg-${region_no}.g.vcf.gz
        """
}

And in the nextflow config tried as below.
docker.enabled = true
docker.runOptions = '-v ~/.aws/credentials:/root/.aws/credentials'

But I get to see the following error. Aws credentials are not getting passed in batch setup.
Failed to open bam file s3://****   permission denied'

The second setup works perfectly fine in the 'local' setup. 
Is there a way to make it work with batch executor? Many Thanks in advance.

Regards
Lavanya



Paolo Di Tommaso

unread,
Apr 17, 2018, 12:17:48 PM4/17/18
to nextflow
It sounds a IAM configuration problem for the Batch service.

Make sure you are able to download that files using a plain Batch job.


p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages