Split by chromosome analysis

383 views

Skip to first unread message

Lavi Bharath

unread,

Apr 17, 2018, 8:07:44 AM4/17/18

to Nextflow

Hi,

Can someone share your experiences if you have already tried this.

We use Aws batch along with docker configuration for this analysis

I am trying run gatk HaplotypeCaller per chromosome. Workflow works if I try to launch one process per chromosome as below.

Enter code here...
sample_bam = Channel

.from ( samples_keys )

.map {

[it, file(params.sample_bam_map[it]) ] }

sample_bai = Channel

.from ( samples_keys )

.map {

[it, file(params.sample_bam_map[it]) + ".bai" ] }

process gatk_hc {

tag "region $region_no for sample $sample_key"

cpus 4

memory '8 GB'

errorStrategy 'finish'

input:

set sample_key, file("${sample_key}.bqsr.bam") from sample_bam

set sample_key, file("${sample_key}.bqsr.bam.bai") from sample_bai

each region_list from params['references']['region_clusters']

file(genome)

file(genome_index)

file(genome_dict)

file(dbsnp)

file(dbsnp_index)

output:

set sample_key, region_no, file("reg-${region_no}.bed"), file("reg-${region_no}.g.vcf.gz"), file("reg-${region_no}.g.vcf.gz.tbi") into region_gvcf_ch1

script:

bed_str = region_list.join("\n").replace(":", "\t").replace("-", "\t")

region_no = generateMD5_A(region_list.toString())[0..8]

"""

echo "${bed_str}" > reg-${region_no}.bed;

gatk-launch --java-options "-Xmx8G -XX:ConcGCThreads=${task.cpus} -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=${task.cpus}" HaplotypeCaller -R ${genome} --dbsnp ${dbsnp} -I ${sample_key}.bqsr.bam -L reg-${region_no}.bed --emit-ref-confidence GVCF -O reg-${region_no}.g.vcf.gz

"""

}

In this case every chromosome will try to download the bam (whole genome) and works perfect.

In order to reduce the time taken to download the entire bam file data S3 (many times actually for each chromosome), we tried pulling only part of bam for each chromosome as below.

sample_bam = Channel

.from ( samples_keys )

.map {

[it, (params.sample_bam_map[it]) ] }

sample_bai = Channel

.from ( samples_keys )

.map {

[it, (params.sample_bam_map[it]) + ".bai" ] }

process gatk_hc {

tag "region $region_no for sample $sample_key"

cpus 4

memory '4 GB'

errorStrategy 'finish'

cache 'deep'

input:

set sample_key, val(sample) from sample_bam

set sample_key, val(sample_bai) from sample_bai

each region_list from params['references']['region_clusters']

file(genome)

file(genome_index)

file(genome_dict)

file(dbsnp)

file(dbsnp_index)

output:

set sample_key, region_no, file("reg-${region_no}.bed"), file("reg-${region_no}.g.vcf.gz"), file("reg-${region_no}.g.vcf.gz.tbi") into region_gvcf_ch1

script:

bed_str = region_list.join("\n").replace(":", "\t").replace("-", "\t")

region_no = generateMD5_A(region_list.toString())[0..8]

"""

echo "${bed_str}" > reg-${region_no}.bed;

samtools view -b -L reg-${region_no}.bed ${sample} > ${sample_key}.bqsr.reg-${region_no}.bam;

samtools index ${sample_key}.bqsr.reg-${region_no}.bam;

gatk-launch --java-options "-Xmx8G -XX:ConcGCThreads=${task.cpus} -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=${task.cpus}" HaplotypeCaller -R ${genome} --dbsnp ${dbsnp} -I ${sample_key}.bqsr.reg-${region_no}.bam -L reg-${region_no}.bed --emit-ref-confidence GVCF -O reg-${region_no}.g.vcf.gz

"""

}

And in the nextflow config tried as below.

docker.enabled = true

docker.runOptions = '-v ~/.aws/credentials:/root/.aws/credentials'

But I get to see the following error. Aws credentials are not getting passed in batch setup.

Failed to open bam file s3://**** permission denied'

The second setup works perfectly fine in the 'local' setup.

Is there a way to make it work with batch executor? Many Thanks in advance.

Regards

Lavanya

Paolo Di Tommaso

unread,

Apr 17, 2018, 12:17:48 PM4/17/18

to nextflow

It sounds a IAM configuration problem for the Batch service.

Make sure you are able to download that files using a plain Batch job.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages