Options to speed up sequenza analysis on large bam file

893 views
Skip to first unread message

Marc Williams

unread,
Dec 24, 2014, 7:15:36 AM12/24/14
to sequenza-...@googlegroups.com
Sequenza looks like a great package and I'm looking forward to using it. I was wondering whether any of you have suggestions for using sequenza with large data sets.

I have a large number of tumour - normal samples that have been whole genome sequenced to high depth (~100X) so the resulting bam files are about ~250G each. I figured the best way would be to split them up by chromosome and combine in the end. I'm also using the bam2seqz option as  I don't need the pileup files. Even for one chromosome the processing is still yet to finish after 3 days.

here's my command:


python /data/home/mpx155/R/packages/3.1.0/sequenza/exec/sequenza-utils.py bam2seqz \
-binning -w 50 --fasta /data/BCI-EvoCa/william/referenceHG19/ucsc.hg19.fasta \
-n samtools mpileup -f /data/BCI-EvoCa/william/referenceHG19/ucsc.hg19.fasta -q 20 \ 
/data/BCI-EvoCa/marc/gastric_cancers/bamfiles/chr1/original_align/normal.bam > normal.fifo \
-t samtools mpileup -f /data/BCI-EvoCa/william/referenceHG19/ucsc.hg19.fasta -q20 \
/data/BCI-EvoCa/marc/gastric_cancers/bamfiles/chr1/original_align/tumour.bam >tumour.fifo \
-gc /data/BCI-EvoCa/marc/refs/hg19.gc5Base.txt.gz | gzip > out.seqz.gz

Any suggestions as how to best tackle this with sequenza would be much appreciated.

Marc


Francesco Favero

unread,
Jan 12, 2015, 7:16:53 AM1/12/15
to sequenza-...@googlegroups.com
Dear Marc,

apologies for the late reply.

Looking at your code, it seems that you run the python script in the wrong way, probably the python script is not terminating immediately as it should, it sound like a bug I haven't though of. I will look into it.

I usually execute sequenza-utils with pypy instead of python, it will give you a considerable boost (4x to 6x time faster in my experience), especially for really big files.
I would run something similar:

pypy /data/home/mpx155/R/packages/3.1.0/sequenza/exec/sequenza-utils.py bam2seqz \
 --fasta /data/BCI-EvoCa/william/referenceHG19/ucsc.hg19.fasta \
 -n /data/BCI-EvoCa/marc/gastric_cancers/bamfiles/chr1/original_align/normal.bam \
 -t  /data/BCI-EvoCa/marc/gastric_cancers/bamfiles/chr1/original_align/tumour.bam \
 -gc /data/BCI-EvoCa/marc/refs/hg19.gc5Base.txt.gz \
 --chromosome chr1 | \gzip > out_chr1.seqz.gz

You can run multiple chromosomes in parallel, using GNU parallel for example as suggested in the wiki:

https://bitbucket.org/ffavero/sequenza/wiki/Sequenza_Utils#markdown-header-parallelize-the-execution

After you have a seqz.gz files you can use the binning function.

Alternatively you could add the binning process in the pipeline, but you will lose the "bigger" un-binned files:


pypy /data/home/mpx155/R/packages/3.1.0/sequenza/exec/sequenza-utils.py bam2seqz \
 --fasta /data/BCI-EvoCa/william/referenceHG19/ucsc.hg19.fasta \
 -n /data/BCI-EvoCa/marc/gastric_cancers/bamfiles/chr1/original_align/normal.bam \
 -t  /data/BCI-EvoCa/marc/gastric_cancers/bamfiles/chr1/original_align/tumour.bam \
 -gc /data/BCI-EvoCa/marc/refs/hg19.gc5Base.txt.gz \
 --chromosome chr1 | \
pypy /data/home/mpx155/R/packages/3.1.0/sequenza/exec/sequenza-utils.py binning -w 50 | \
gzip > out_chr1_bin50.seqz.gz


I usually process all the chromosomes in parallel without binning, and then merge the chromosome files while binning, obtained a single file for all the genome.

I'll try to update the wiki if something is not clear.

Thanks for trying sequenza :)

Best

Marc Williams

unread,
Jan 12, 2015, 8:00:42 AM1/12/15
to sequenza-...@googlegroups.com
Dear Francesco,

Thanks for your reply. Indeed I was running the command in the wrong way, like you say the python script was not terminating, I thought that it was just taking a long time. I've managed to get the process to work within a reasonable timeframe using pileup2seqz but as I don't need the pileup files the bam2seqz would be ideal. I don't have much experience with named pipes so was a bit confused, just to confirm I don't need to make the named pipes myself but rather can call bam2seqz with my bamfiles and it will generate the pipes? I think I got a little confused due to section 3.3 in the manual where bam2seqz is called with normal.fifo and tumour.fifo.

Thanks for the other suggestions for speed up I'll look into it.

Marc

Marc Williams

unread,
Jan 12, 2015, 9:54:39 AM1/12/15
to sequenza-...@googlegroups.com
Got bam2seqz working using your command. Will look into your other suggestions. Cheers Marc
Reply all
Reply to author
Forward
0 new messages