Fwd: goby-framework - Google Groups: Message Pending [{II6Ih8eGlaCQRyoCcWEwAYljj77Gx6Hu0}]

9 views
Skip to first unread message

Fabien Campagne

unread,
Feb 11, 2013, 11:18:14 AM2/11/13
to goby-fr...@googlegroups.com

Forwarded message ----------
From: Dror Kessler <dror.k...@gmail.com>
To: goby-fr...@googlegroups.com
Date: Mon, 11 Feb 2013 17:26:47 +0200
Subject: Goby BAM->Compact->BAM (alignment files) to smallest size + lossless
[moving discussion from http://campagnelab.org/software/goby/reference-documentation/modes/compact-to-sam/comment-page-1/#comment-1270]

Initial Query (Dror):
We are trying to use hybrid-1 compression (like in the benchmark) to compress BAM files into to compact files with references to a reads file in a way that the restored bam file will indeed include retrieved information from the reads in the reads file (quality, attributes, etc). To the best of my understanding, the benchmark does not show such a use case. It does compress/restore using hybrid-1 but with low fidelity values (as shown in the paper). In theory and according to texts and data structures this should be possible but I’m not sure how. Reading the source file for the SamToCompact mode I see no references or usage of the ‘-q’ argument which I understand is the carrier of the reads file. Is this something that should be possible? currently or in the future?

Initial Response (Fabien):
Regarding point 2, I assume you have used the options described in methods (see supp material file) to capture the most information from the BAM file (e.g., quality score across the entire read, paired unmapped reads, all tags). You can also find these options described in this online material:http://campagnelab.org/software/goby/tutorials/whats-new-in-goby-2-0/
After using these options, the only bit you might be missing are completely unmapped reads, which we keep as in read files.

Follow up:

My process is to extact the reads from the bam file and then to compress:

bash $GOBY_HOME/goby 16g ser \
$1 \
-o $2

$1 - input bam file
$2 - basename for resulting compact read file

bash $GOBY_HOME/goby 16g stc \
-i $1 \
-o $2 \
-g $3 \
-q $4 \
--propagate-query-ids \
-x AlignmentWriterImpl:permutate-query-indices=false \
        -x SAMToCompactMode:ignore-read-origin=false \
        -x MessageChunksWriter:codec=hybrid-1 \
        -x AlignmentCollectionHandler:enable-domain-optimizations=true \
        -x MessageChunksWriter:compressing-codec=true

$1 - input bam file
$2 - basename for output compact file(s)
$3 - genome (in compact format)
$4 - reads file from previous step

Stats for the compact file and first 100 lines of the input and output bams (in sam format) are attached.

My expectation was that the output bam will restore lossless, with missing (duplicate) information restored from genome (exact matching reads) and reads file (non-exact + qualities + attributes).

I know I'm doing something wrong since the stats for the compact file does not contain a reference to the reads file (so it makes sense it will not be consulted on the way back - on the compact-to-sam).

As you can see, the qualities and attributes (at least) are not being restored from the reads file. 

Can you please advise?

Thanx in advance.



1000S.sort.full.mode3.restored.bam.ssa.1000.sam
1000S.sort.full.mode3.bam.entries.stats
1000S.sort.full.bam.1000.sam
Reply all
Reply to author
Forward
0 new messages