Index loading error when using seuqueza-utils bam2seqz with pileup input

336 views
Skip to first unread message

RYAN MIN-HWAN SOHN

unread,
Oct 5, 2020, 1:45:09 AM10/5/20
to Sequenza User Group
Hi, I'm using sequenza-utils bam2seqz to generate a seqz.gz file from a pileup input

Unfortunately, I'm having trouble operating the process mentioned above due to this error regarding a issue not finding a .tbi or .csi index for the respective pileup file.

A detailed error message is:

Could not load .tbi/.csi index of input.pileup

A detailed command i've been using is:

sequenza-utils \
bam2seqz \
--pileup \
-n ${normal}.pileup \
-t ${tumor}.pileup \
--fasta genome.fa \
-C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT \
--parallel 39 \
--samtools samtools-exec \
--tabix tabix-exec \
-gc genome_primary.gc50Base.wig.gz \
-o ${sample}.seqz.gz

Of note, due to an excessive time to generate pileup file from bam file using sequenza-utils alone, I first processed my bam files using samtools and GNU parallel to make pileup files. Do I need to index this pileup file? Is it even possible to index pileup files? If it is, then how can I make the indexing possible?

Can you suggest any solution to this issue?

Thank you

Kind regards,
Ryan

Francesco Favero

unread,
Oct 5, 2020, 4:14:01 AM10/5/20
to RYAN MIN-HWAN SOHN, Sequenza User Group
Hi Ryan,

yes, if you query a chromosome in the pileup you need to index it with tabix (so you can fetch the data from a specific region).

To get the tbi it’s quite easy, the problem is that the pipleup need to be compressed with bgzip, and then the tabix command is something like this:

    tabix -s 1 -b 2 -e 2 <file.pileup.gz>

However, looking at your command line, I’m not sure the bam2seqz command will work properly with that massive parallelism, I haven’t tested it that much on that regard.
If you have problem you could try the pipeline in the sequenza docker (https://hub.docker.com/r/sequenza/sequenza)/
The pipeline in the docker processes the bam in parallel chromosome by chromosome. It sould work with singularity and others as well, or you could just install the python pipeline locally if you have problem with containers

Best regards

Francesco

--
You received this message because you are subscribed to the Google Groups "Sequenza User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sequenza-user-g...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sequenza-user-group/103fb214-2e1b-45bc-8094-f12cfde2bdd8n%40googlegroups.com.

RYAN MIN-HWAN SOHN

unread,
Oct 7, 2020, 1:26:03 AM10/7/20
to Sequenza User Group
Hi, Francesco

First of all, I appreciate your quick and helpful reply. 
The indexing of the pileup files using tabix by the command you kindly suggested, indeed generated *.tbi index and subsequently, I managed to generate *seqz.gz file using the command below. Thankfully, sequenza-utils could handle this "massive parallelism" about which you concerned. 

sequenza-utils \
bam2seqz \
--pileup \
-n ${normal}.pileup \
-t ${tumor}.pileup \
--fasta genome.fa \
-C 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT \
--parallel 25 \
--samtools samtools-exec \
--tabix tabix-exec \
-gc genome_primary.gc50Base.wig.gz \
-o ${sample}.seqz.gz

I noticed that the command above generated seqz.gz files separately by chromosome number, such as ${sample}_1.seqz.gz, ${sample}_2.seqz.gz, ${sample}_3.seqz.gz..., ${sample}_MT.seqz.gz, and tabix-indexed these seqz.gz files separately as well. So I first used seqz_binning command to bin the original seqz files by:

for chromosome in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT
do
${sequenza} \
seqz_binning \
--seqz ${sample}_${chromosome}.seqz.gz \
-w 50 \
-o ${sample}_${chromosome}.small.seqz.gz
done

And then, using seqz_merge iteratively (A far better solution to this might exist though. This was all I could come up with) :

sequenza-utils \
seqz_merge \
-o seq1.small.seqz.gz \
-1 ${sample}_1.small.seqz.gz \
-2 ${sample}_2.small.seqz.gz

sequenza-utils \
seqz_merge \
-o seq2.seqz.gz \
-1 seq1.seqz.gz \
-2 ${sample}_3.small.seqz.gz  

.
.
.

I could finally generate output.small.seqz file for an input to sequenza analysis in R.

Thank you again for your help, Francesco!

Best regards,
Ryan

2020년 10월 5일 월요일 오후 5시 14분 1초 UTC+9에 favero.f...@gmail.com님이 작성:

Francesco Favero

unread,
Oct 7, 2020, 4:04:38 AM10/7/20
to RYAN MIN-HWAN SOHN, Sequenza User Group
Hi Ryan,

That’a a great news! I just wanted to mention that, although the seqz_merge function worked, it’s not meant for the merge of different chromosome, but for the merging of overlapping seqz files (eg a file originated from mutect vcf only has mutation, you can mergit with a seqz files from the same sample containing only heterozygous positions…).
That’s why the interface is impractical (it takes only 2 files at a time)

In your case you could do more simply:

zcat ${sample}_*.seqz.gz | awk ‘{if (NR==1) {print $0} else {if ($1!=“chromosome”){print $0}}}’ |\
sequenza-utils seqz_binning --seqz --window 50 -o ${sample}_bin50.seqz.gz

Or something similar (I’m not sure this will keep a proper chromosome order, eg numeric sorting vs character sorting)

Best

Francesco
> To view this discussion on the web visit https://groups.google.com/d/msgid/sequenza-user-group/2069e771-8d66-4e11-870b-3b2a99498b83n%40googlegroups.com.

RYAN MIN-HWAN SOHN

unread,
Oct 7, 2020, 10:16:11 PM10/7/20
to Sequenza User Group
Hi Francesco,

Again, thank you very much indeed for your considerate answer.

Your recommendation (merging seqz files from different chromosomes with simple Linux command) was extremely helpful. I've tweaked the command slightly to keep a desired chromosome order and to avoid a minor error due to a miss of hyphen(-) behind --seqz option:

zcat ${sample}_1.seqz.gz ${sample}_2.seqz.gz ${sample}_3.seqz.gz ${sample}_4.seqz.gz ${sample}_5.seqz.gz ${sample}_6.seqz.gz ${sample}_7.seqz.gz ${sample}_8.seqz.gz ${sample}_9.seqz.gz ${sample}_10.seqz.gz ${sample}_11.seqz.gz ${sample}_12.seqz.gz ${sample}_13.seqz.gz ${sample}_14.seqz.gz ${sample}_15.seqz.gz ${sample}_16.seqz.gz ${sample}_17.seqz.gz ${sample}_18.seqz.gz ${sample}_19.seqz.gz ${sample}_20.seqz.gz ${sample}_21.seqz.gz ${sample}_22.seqz.gz ${sample}_X.seqz.gz ${sample}_Y.seqz.gz | awk '{if (NR==1) {print $0} else {if ($1!="chromosome"){print $0}}}' | sequenza-utils seqz_binning -s - --window 50 -o ${sample}_bin50.seqz.gz

Not only it turned out great but It took only 10 or so minutes to process this binned seqz file using sequenza library in R.

I really appreciate it. 
Wish you all the best!

Kind regards
Ryan

2020년 10월 7일 수요일 오후 5시 4분 38초 UTC+9에 favero.f...@gmail.com님이 작성:
Reply all
Reply to author
Forward
0 new messages