Fwd: goby-framework - Google Groups: Message Pending [{ILGIo4Wexe7_CyoCcWEwARlUTOcvXWvC0}]

16 views
Skip to first unread message

Fabien Campagne

unread,
Feb 11, 2013, 11:57:22 AM2/11/13
to goby-fr...@googlegroups.com

---------- Forwarded message ----------
From: Dror Kessler <dror.k...@gmail.com>
To: goby-fr...@googlegroups.com
Cc: 
Date: Mon, 11 Feb 2013 17:09:46 +0200
Subject: Goby BAM->Compact->BAM Lossless

Initial Query (Dror):  
Trying achieve 100% bam->compact->bam fidelity (on any compression, with preserving all or without it). I noticed also that in our paper you define fidelity in terms of percentage of size restored. We are looking at goby as a lossless compression scheme for magnitudes of bams so we are trying to achieve exact restoration of bams. I have specific examples but I’m looking for a statement from you, as Goby’s designer, if Goby is indeed intended for such lossless compression and how to achieve it.

Initial response (Fabien): 
We illustrate fidelity in the Goby manuscript with the percentage of size restored, but we actually determine fidelity differently. To do this, you essentially need to compare data before compression to data after decompressing the compressed data. We therefore constructed round trip tests to verify compression/decompression fidelity. You can find the code of these tests under the class edu.cornell.med.icb.goby.readers.sam.TestGobyPaperTop5000s (test-src folder).

New info (Dror):
I think I understand your methodology and am doing something similar - converting source BAM to SAM as well as resulting BAM to SAM and comparing. As I wrote, I'm looking for an exact match - i.e. a lossless round trip. In this test, I gladly accept any compression I get (currently 60%) and am looking mainly at the ability to fully restore the initial BAM - not bit to bit - but rather have it contain **exactly** the same content (as observed by the SAM equivalent).

The command I'm using for bam-to-compact is (using stable from github)

bash $GOBY_HOME/goby 16g stc \
-i $1 \
-o $2 \
-g $3 \
--preserve-all-mapped-qualities \
--preserve-all-tags \
--preserve-soft-clips \
--preserve-read-names \
-x AlignmentWriterImpl:permutate-query-indices=false \
        -x SAMToCompactMode:ignore-read-origin=false \
        -x MessageChunksWriter:codec=hybrid-1 \
        -x AlignmentCollectionHandler:enable-domain-optimizations=true \
        -x MessageChunksWriter:compressing-codec=true

Where: 
$1 - input bam file
$2 - output goby basename
$3 - genome, after build-sequence-cache

The conversion back to bam is using bam-to-compact. Samtools are used for bam to sam. The bam file has amount 100K records. The stats for the goby entries are attached.

The first 100 lines of the input and output bams (in sam format) are also attached. I have sorted the attributes by their name to assist in comparing.

As you can observe, the are not the same. If you study the first line, you will note the furst difference on the RNEXT & PNEXT columns (8,9). 

Could you please comment on these differences and suggest a way to advance to lossless bam->compact-bam process?

Thank you again!

1000S.sort.full.bam.1000.sam
1000S.sort.full.bam.mode1.header.stats
1000S.sort.full.mode1.restored.bam.ssa.1000.sam

Fabien Campagne

unread,
Feb 14, 2013, 5:02:49 PM2/14/13
to goby-fr...@googlegroups.com
Hi Dror,

Thanks for the details. Sorry it took a while to get back to you. At first glance, regarding the RNEXT & PNEXT columns: the first SAM input line looks odd: the pair flag (value=73) indicates that the read is mapped and pair/mate unmapped, yet the pair information gives a location (same sequence '=' character). A file that follows the SAM specification, at least according to my understanding, should have no pair location when the mate is flagged as unmapped ("*", not "=" in RNEXT). The reason that decompression does not restitute the '=' is because the proper mate reference must be '*' when the sam flag indicates mate unmapped. (see http://ppotato.wordpress.com/2010/08/25/samtool-bitwise-flag-paired-reads/).
Do you see a reason why RNEXT should be '=' with an unmapped mate? Was this file produced by a particular aligner? 

Please note that we compress the semantic of the file, not the actual characters. For instance, we will compress tags when the user specifies the --preserve-all-tags option, but the order of the tags may change between the input and the decompressed file because tag order is syntactic, but has no impact on the meaning/semantic of a BAM file. 

Exact character for character compression (i.e., especially of invalid files) is not a goal of our project because in practice we care more about computing with valid compressed data.

Fabien

Dror Kessler

unread,
Feb 15, 2013, 4:23:12 AM2/15/13
to goby-fr...@googlegroups.com
Fabien Hi. Thanx for your message. I will study and reply. Dror
--
You received this message because you are subscribed to the Google Groups "Goby file formats, framework and tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to goby-framewor...@googlegroups.com.
To post to this group, send email to goby-fr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/goby-framework/-/v-d9YvYZH78J.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Dror Kessler

unread,
Feb 17, 2013, 6:39:23 AM2/17/13
to goby-fr...@googlegroups.com
Fabien Hi;

I understand the semantic fault you are indicating in the input file. We are studying it. 

Regards the second query/issue - can you please comment on how to compress the SAM with references to a reads file as described?

Thank you again for your time and effort

Dror

Reply all
Reply to author
Forward
0 new messages