Re: Update, Short Question

6 views
Skip to first unread message

Fabien Campagne

unread,
Mar 5, 2013, 8:13:03 AM3/5/13
to Dror Kessler, goby-fr...@googlegroups.com
Dear Dror,

Sorry for the delay, I was away from the office for a while. The -q option is certainly something you could implement to recover information stored in a reads file that you do not store in a compressed alignment file. I would expect this to be quite challenging because reads are typically not in the order found in the alignment file and you will therefore need random access to the read data. Given the scale of the read files, this will not be trivial and at first glance, I would expect performance to be relatively low. It is not clear if performance will be sufficient for practical use, although if you use this only for archival it might be OK. Feel free to try this if you like, as we are not pursuing this line of work at this time. 

One piece of information we don't store at all in alignment files is the unmapped reads. You can retrieve these reads from the reads file  by scanning the reads file with the index of the unmapped reads. You just need one pass across the reads file to retrieve all unmapped reads (this is a special case of what you describe, here we don't need to merge the read data with the alignment data). We use this strategy to implement a pathogen detection pipeline in GobyWeb and performance is quite good (see GobyWeb pre-print). 

Hope this helps, Best, Fabien


On Wed, Feb 27, 2013 at 8:10 AM, Dror Kessler <dror.k...@gmail.com> wrote:
Dear Fabien;

I hope you doing well.

I wanted to update you that since our last communication, we've worked to "sanitize" our bam file (remove semantic erros) and now they are able to being compressed and decompressed using Goby. As I wrote before, its a fine framework.

The command line we're using to compress (as I wrote before) is:

bash $GOBY_HOME/goby 16g stc \
        -i $1 \
        -o $2 \
        -g $3 \
        --preserve-all-mapped-qualities \
        --preserve-all-tags \
        --preserve-soft-clips \
        --preserve-read-names \
        -x AlignmentWriterImpl:permutate-query-indices=false \
        -x SAMToCompactMode:ignore-read-origin=false \
        -x MessageChunksWriter:codec=hybrid-1 \
        -x AlignmentCollectionHandler:enable-domain-optimizations=true \
        -x MessageChunksWriter:compressing-codec=true

We call this compress mode 1.

We are imaging (and somewhat came to believe according to the documentation) that it is possible to achieve greater compression rates by dropping information available in a reads file (read names, attributes, etc) and when decompressing retrieve them back from the reads file (which is of course a common file to multiple alignment field). The command line for this mode (we call it locally mode 3) to be something like this:

bash $GOBY_HOME/goby 16g stc \
        -i $1 \
        -o $2 \
        -g $3 \
        -q $4 \
        --propagate-query-ids \
        -x AlignmentWriterImpl:permutate-query-indices=false \
        -x SAMToCompactMode:ignore-read-origin=false \
        -x MessageChunksWriter:codec=hybrid-1 \
        -x AlignmentCollectionHandler:enable-domain-optimizations=true \
        -x MessageChunksWriter:compressing-codec=true

Note the reference to the reads file (-q)

I wrote to you before on this. 

Can you please comment in general if this is indeed possible (to create references to a reads file, dropping the read names, attrs, etc) and achieve lossless compression/decompression.

I can imagine writing a utility to post process the decompression output and merge it with the reads file - but I'd like to not duplicate the work.

Thank you in advance

Dror






--
Fabien Campagne, PhD -- http://campagnelab.org

Assistant Professor,    Dept. of Physiology and Biophysics
                         Institute for Computational Biomedicine
Associate Director,      Biomedical Informatics Core, 
                      Clinical Translational Science Center 

Weill Medical College of Cornell University
phone:  (646)-962-5613  1305 York Avenue
fax:    (646)-962-0383  Box 140
New York, NY 10021

Do you speak next-gen?

See how GobyWeb can help simplify your NGS projects at http://gobyweb.campagnelab.org

Dror Kessler

unread,
Mar 6, 2013, 6:22:31 AM3/6/13
to goby-fr...@googlegroups.com
Dear Fabien;

Thank you for your response. I hope you had a good time off/vacation.

Yes, I have already implemented something similar to the -q option and to overcome the issue you indicated I have used an inprocess Java database library (https://code.google.com/p/jdbm2/ - I'm looking into using more advanced versions of this project since they are faster) to index the reads file on the read name. It slows things down for sure but works. I have also added some info to the ReadEntry record. This is all experimental and I will write more when I can additional results.

Thanx again

Dror

--
You received this message because you are subscribed to the Google Groups "Goby file formats, framework and tools" group.
To unsubscribe from this group and stop receiving emails from it, send an email to goby-framewor...@googlegroups.com.
To post to this group, send email to goby-fr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages