On Jun 29, 2012, at 8:47 AM, Fabien Campagne wrote:Hello James,Thanks for your comments. I'll try to answer the various points you noted:1. Compression/decompression speed. You note that Goby is in the ballpark, but I would like to note that what yourmeasure includesa. conversion to BAM modelb. compressionand similarly, when you write back BAM from Goby:c. decompression,d conversion from Goby encoding to BAM.If you wrote an alignment directly from an aligner with the Goby representation, you would not incur a. When you work directly with Goby alignments, you do not incur d. The cost of a or d. turns out to dominate the cost of b and c in the conversions from/to BAM because Goby represents alignments slightly differently from BAM (we think the method we use is simpler and more extensible).If you were to measure compression only (you can do this with the concatenate alignment that will happily recompress an existing alignment with a different codec), or measure decompression only (e.g., timing the compact-file-stats mode that decompresses every entry of an alignement), you would probably find that Goby compression/decompression is much closer to samtools.
We are not interested in decompression alone. What we care more in practice is decoding, i.e. decompressing data and then representing the alignment in a data structure ready for use by other APIs. I am assuming that once goby decodes an alignment, it will take similar amount of time, in comparison to Picard, to write the alignment in the SAM format.
On Fri, Jun 29, 2012 at 7:40 AM, James Bonfield <j...@sanger.ac.uk> wrote:
On Fri, Jun 29, 2012 at 09:56:45AM +0100, James Bonfield wrote:I take that back now - it was partly my input data. On a more sensible
> However it is indeed very slow compared to other alternatives. I've
set it operates reasonably.
A quick test on the very shallow small test set from SeqSqueeze; about
300,000 reads aligned against the human genome:
Prog Size C.Time
--------------------------------------
samtools 28535830 6.2s
fqzcomp (low) 15682012 1.6s
fqzcomp (high) 15282395 2.8s
samcomp1 16222671 5.6s
samcomp1 -r 9743923 6.8s
goby 12742632[1] 22.8s
goby -g 12742632[1] 18.8s
CRAM 11152360[2] 41.2s
[1] Lost 4.2% of the data, unmapped reads?
[2] No read names
So there are a few oddities. Samtools is artificially high here as it
includes the auxillary fields which other programs are not storing
either because they can't (fqzcomp, samcomp) or have been told not
to.
fqzcomp is just a fastq compressor, so it stores even less. It shows
though the raw name, seq, qual size we can get.
samcomp1 with and without a reference shows a substantial variation in
size, as expected. CRAM is somewhere between the two in ratio (and
excludes names, which took up about 810k in samcomp1). Goby is doing
great without a reference and bizarrely making no difference with one.
I must be doing something wrong. It's the same fasta file I supplied
CRAM and samcomp1 with though so I'm sure it's correct. However it
just seems to have no impact on the result.
Speed wise Goby is faster than CRAM here. Maybe the extreme low
coverage is being unfair as it perhaps is testing the time to load the
reference more than to load the data.
Anyway, it's in the right ballpark.
James
--
James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Samtools-devel mailing list
Samtool...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-devel
--
Fabien Campagne, PhD -- http://campagnelab.org
Assistant Professor, Dept. of Physiology and Biophysics
Institute for Computational Biomedicine
Associate Director, Biomedical Informatics Core,
Clinical Translational Science Center
Weill Medical College of Cornell University
phone: (646)-962-5613 1305 York Avenue
fax: (646)-962-0383 Box 140
New York, NY 10021
Do you speak next-gen?
See how GobyWeb can help simplify your NGS projects at http://gobyweb.campagnelab.org
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
On Fri, Jun 29, 2012 at 9:55 AM, Heng Li <l...@sanger.ac.uk> wrote:Heng, Your assumption is incorrect. We made different data representation choices, and there is a cost for the conversions Goby <> SAM, which does not exist with Picard since its representation is aligned with SAM. To be more specific, we store spliced alignments as two aligned entries, not one as is done in SAM/BAM. This makes it possible to represent fusions natively without tricks (e.g., see what the TopHat group had to do to store fusion info in SAM format). We also store sequence variations differently: we don't store CIGAR strings but instead have a list of sequence variation data structures, which stores all the info. This list of differences is not exhaustive. Goby diverged from BAM when we believed there was an opportunity to improve program readability, improve program performance, or simplify common tasks. Note that while the representations are different, they provide similar functionality.
We designed Goby to store sequencing data the most effectively we can, so that we can compute with it. It comes with its own APIs (which we think are simpler to learn and use than picard). The APIs encode/decode data between disk and data structures in memory. The compression/decompression steps I refer to obviously include encoding/decoding to data structures (the Goby ones). Since the Goby data structures are different from the ones used in BAM/SAM, programmers used to SAM may find that their intuition about SAM is not very useful to predict performance with Goby.Decoding performance can be measured for a Goby alignment with the command:goby 3g compact-file-stats alignment.entries [this will traverse the entire file to collect simple statistics, data are completely decompressed and decoded to memory with the Goby API]You will find that performance of this process is in the ballpark of decoding an equivalent BAM alignment with samtools (when using the hybrid-1 codec for best compression), or is faster than samtools (when using the GZIP codec for best speed). The codec is an option that lets users/developers control the tradeoffs for a particular application.
On Jun 30, 2012, at 2:07 PM, Fabien Campagne wrote:On Fri, Jun 29, 2012 at 9:55 AM, Heng Li <l...@sanger.ac.uk> wrote:
Heng, Your assumption is incorrect. We made different data representation choices, and there is a cost for the conversions Goby <> SAM, which does not exist with Picard since its representation is aligned with SAM. To be more specific, we store spliced alignments as two aligned entries, not one as is done in SAM/BAM. This makes it possible to represent fusions natively without tricks (e.g., see what the TopHat group had to do to store fusion info in SAM format). We also store sequence variations differently: we don't store CIGAR strings but instead have a list of sequence variation data structures, which stores all the info. This list of differences is not exhaustive. Goby diverged from BAM when we believed there was an opportunity to improve program readability, improve program performance, or simplify common tasks. Note that while the representations are different, they provide similar functionality.So far as I am aware, the major difference between goby and BAM is that in goby reads and alignments are separated, while in BAM, they always come together. I am surprised that this difference will add such a great overhead to the conversion between the goby and BAM data representations. Reconstructing CIGAR from differences should be a fast operation.
We designed Goby to store sequencing data the most effectively we can, so that we can compute with it. It comes with its own APIs (which we think are simpler to learn and use than picard). The APIs encode/decode data between disk and data structures in memory. The compression/decompression steps I refer to obviously include encoding/decoding to data structures (the Goby ones). Since the Goby data structures are different from the ones used in BAM/SAM, programmers used to SAM may find that their intuition about SAM is not very useful to predict performance with Goby.Decoding performance can be measured for a Goby alignment with the command:goby 3g compact-file-stats alignment.entries [this will traverse the entire file to collect simple statistics, data are completely decompressed and decoded to memory with the Goby API]You will find that performance of this process is in the ballpark of decoding an equivalent BAM alignment with samtools (when using the hybrid-1 codec for best compression), or is faster than samtools (when using the GZIP codec for best speed). The codec is an option that lets users/developers control the tradeoffs for a particular application.With the default codec (what is the default, zlib or hybrid? I actually do not know how to change
), samtools index/flatstat is 5 times as fast as compact-file-stats. Compact-file-stats is indeed much faster than compact-to-sam, but I really do not understand what overhead goby adds.
On Fri, Jun 29, 2012 at 9:55 AM, Heng Li <l...@sanger.ac.uk> wrote:Heng, Your assumption is incorrect. We made different data representation choices, and there is a cost for the conversions Goby <> SAM, which does not exist with Picard since its representation is aligned with SAM. To be more specific, we store spliced alignments as two aligned entries, not one as is done in SAM/BAM. This makes it possible to represent fusions natively without tricks (e.g., see what the TopHat group had to do to store fusion info in SAM format). We also store sequence variations differently: we don't store CIGAR strings but instead have a list of sequence variation data structures, which stores all the info. This list of differences is not exhaustive. Goby diverged from BAM when we believed there was an opportunity to improve program readability, improve program performance, or simplify common tasks. Note that while the representations are different, they provide similar functionality.So far as I am aware, the major difference between goby and BAM is that in goby reads and alignments are separated, while in BAM, they always come together. I am surprised that this difference will add such a great overhead to the conversion between the goby and BAM data representations. Reconstructing CIGAR from differences should be a fast operation.Again, your intuition may not be a good guide here. There are more differences than you seem to realize.
$ goby 3g build-sequence-cache genome.fa -b genome-basename
will yield genome-basename.*, which you can use as follows:$ goby 3g compact-to-sam goby-basename -o output.bam –genome genome-basename
In addition, why the wall-clock time is less than CPU time and sometimes a lot? Is goby multi-threaded by itself or the difference is caused by multithreaded garbage collection or other Java VM operations?
On Thu, Jul 5, 2012 at 1:35 PM, Heng Li <l...@sanger.ac.uk> wrote:
In addition, why the wall-clock time is less than CPU time and sometimes a lot? Is goby multi-threaded by itself or the difference is caused by multithreaded garbage collection or other Java VM operations?
Heng, I was also a bit puzzled so I looked at this closer and found that the difference between user/CPU time and real/clock wall time was due to the JVM hotspot compiler optimizing the code in the background (optimization gets counted towards user/CPU time, but not in wall-clock because on this machine it could run in parallell on other cores). I could verify this by running the code via Nailgun. The first run was much longer, but once the code was optimized, subsequent runs were a bit better than the wall clock speeds I listed earlier. Five minutes is about the time HotSpot needs to optimize this code. On long running jobs we sometimes see a doubling of performance in the first few minutes as the code is rewritten on the fly by the optimizer, so I guess this is worth it. There is less of an impact for short running jobs if you use the --client JVM option, which should be the default on desktop/laptop class machines, but I ran this mini-benchmark on a server-class machine.