Re: [Samtools-devel] Goby 2.0 released

48 views
Skip to first unread message

Fabien Campagne

unread,
Jun 30, 2012, 2:07:48 PM6/30/12
to Heng Li, James Bonfield, samtools-devel, goby-fr...@googlegroups.com


On Fri, Jun 29, 2012 at 9:55 AM, Heng Li <l...@sanger.ac.uk> wrote:

On Jun 29, 2012, at 8:47 AM, Fabien Campagne wrote:

Hello James,

Thanks for your comments. I'll try to answer the various points you noted:

1. Compression/decompression speed. You note that Goby is in the ballpark, but I would like to note that what yourmeasure includes
 a. conversion to BAM model
 b. compression
and similarly, when you write back BAM from Goby: 
 c. decompression, 
 d conversion from Goby encoding to BAM.

If you wrote an alignment directly from an aligner with the Goby representation, you would not incur a. When you work directly with Goby alignments, you do not incur d. The cost of a or d. turns out to dominate the cost of b and c in the conversions from/to BAM because Goby represents alignments slightly differently from BAM (we think the method we use is simpler and more extensible). 

If you were to measure compression only (you can do this with the concatenate alignment that will happily recompress an existing alignment with a different codec), or measure decompression only (e.g., timing the compact-file-stats mode that decompresses every entry of an alignement), you would probably find that Goby compression/decompression is much closer to samtools.

We are not interested in decompression alone. What we care more in practice is decoding, i.e. decompressing data and then representing the alignment in a data structure ready for use by other APIs. I am assuming that once goby decodes an alignment, it will take similar amount of time, in comparison to Picard, to write the alignment in the SAM format.

Heng, Your assumption is incorrect. We made different data representation choices, and there is a cost for the conversions  Goby <> SAM, which does not exist with Picard since its representation is aligned with SAM. To be more specific, we store spliced alignments as two aligned entries, not one as is done in SAM/BAM. This makes it possible to represent fusions natively without tricks (e.g., see what the TopHat group had to do to store fusion info in SAM format). We also store sequence variations differently: we don't store CIGAR strings but instead have a list of sequence variation data structures, which stores all the info. This list of differences is not exhaustive. Goby diverged from BAM when we believed there was an opportunity to improve program readability, improve program performance, or simplify common tasks. Note that while the representations are different, they provide similar functionality. 
  
We designed Goby to store sequencing data the most effectively we can, so that we can compute with it. It comes with its own APIs (which we think are simpler to learn and use than picard). The APIs encode/decode data between disk and data structures in memory. The compression/decompression steps I refer to obviously include encoding/decoding to data structures (the Goby ones). Since the Goby data structures are different from the ones used in BAM/SAM, programmers used to SAM may find that their intuition about SAM is not very useful to predict performance with Goby. 

Decoding performance can be measured for a Goby alignment with the command:

goby 3g compact-file-stats alignment.entries [this will traverse the entire file to collect simple statistics, data are completely decompressed and decoded to memory with the Goby API]

You will find that performance of this process is in the ballpark of decoding an equivalent BAM alignment with samtools (when using the hybrid-1 codec for best compression), or is faster than samtools (when using the GZIP codec for best speed). The codec is an option that lets users/developers control the tradeoffs for a particular application. 

As with any new tool, it may be wise to the read documentation and when in doubt ask questions. We'll be happy to offer additional clarifications at the Goby forum user group. You can look it up here or email directly:  goby-fr...@googlegroups.com 

We look forward to additional discussions with members of the SAM/BAM developer community. Best, Fabien
 
On Fri, Jun 29, 2012 at 7:40 AM, James Bonfield <j...@sanger.ac.uk> wrote:
On Fri, Jun 29, 2012 at 09:56:45AM +0100, James Bonfield wrote:
> However it is indeed very slow compared to other alternatives. I've

I take that back now - it was partly my input data. On a more sensible
set it operates reasonably.

A quick test on the very shallow small test set from SeqSqueeze; about
300,000 reads aligned against the human genome:

Prog            Size            C.Time
--------------------------------------
samtools        28535830         6.2s
fqzcomp (low)   15682012         1.6s
fqzcomp (high)  15282395         2.8s
samcomp1        16222671         5.6s
samcomp1 -r      9743923         6.8s
goby            12742632[1]     22.8s
goby -g         12742632[1]     18.8s
CRAM            11152360[2]     41.2s

[1] Lost 4.2% of the data, unmapped reads?
[2] No read names

So there are a few oddities. Samtools is artificially high here as it
includes the auxillary fields which other programs are not storing
either because they can't (fqzcomp, samcomp) or have been told not
to.

fqzcomp is just a fastq compressor, so it stores even less. It shows
though the raw name, seq, qual size we can get.

samcomp1 with and without a reference shows a substantial variation in
size, as expected. CRAM is somewhere between the two in ratio (and
excludes names, which took up about 810k in samcomp1). Goby is doing
great without a reference and bizarrely making no difference with one.

I must be doing something wrong. It's the same fasta file I supplied
CRAM and samcomp1 with though so I'm sure it's correct. However it
just seems to have no impact on the result.

Speed wise Goby is faster than CRAM here. Maybe the extreme low
coverage is being unfair as it perhaps is testing the time to load the
reference more than to load the data.

Anyway, it's in the right ballpark.

James

--
James Bonfield (j...@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova
                                 | Plurima gyrabant gymbolitare vabo;
 A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi.


--
 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Samtools-devel mailing list
Samtool...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-devel



--
Fabien Campagne, PhD -- http://campagnelab.org

Assistant Professor,    Dept. of Physiology and Biophysics
                         Institute for Computational Biomedicine
Associate Director,      Biomedical Informatics Core, 
                      Clinical Translational Science Center 

Weill Medical College of Cornell University
phone:  (646)-962-5613  1305 York Avenue
fax:    (646)-962-0383  Box 140
New York, NY 10021

Do you speak next-gen?

See how GobyWeb can help simplify your NGS projects at http://gobyweb.campagnelab.org


-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.



--
Fabien Campagne, PhD -- http://campagnelab.org

Assistant Professor,    Dept. of Physiology and Biophysics
                         Institute for Computational Biomedicine
Associate Director,      Biomedical Informatics Core, 
                      Clinical Translational Science Center 

Weill Medical College of Cornell University
phone:  (646)-962-5613  1305 York Avenue
fax:    (646)-962-0383  Box 140
New York, NY 10021

Do you speak next-gen?

See how GobyWeb can help simplify your NGS projects at http://gobyweb.campagnelab.org

Vadim Zalunin

unread,
Jul 2, 2012, 6:56:54 AM7/2/12
to Fabien Campagne, goby-fr...@googlegroups.com, samtools-devel
Hi Fabien,

It seems that goby 2.0.1 does not restore (or store?) some tags completely. Here are the symptoms:
in the source BAM:
OQ:Z:HHHHGGGGFGEGFGGGGGGGGHHHDHHHHHHHGHHHHHHHFFGEGEFFG<?:?0B=ABA,>.==;E

In the restored BAM from GOBY:
OQ:Z:HHHHGGGGFGEGFGGGGGGGGHHHDHHHHHHHGHHHHHHHFFGEGEFFG<?

It looks like ':' is treated as a stop symbol because none of the restored tags have this symbol and symbols beyond it.

Cheers,
Vadim
vadim.vcf

Fabien Campagne

unread,
Jul 2, 2012, 9:51:37 AM7/2/12
to Vadim Zalunin, goby-fr...@googlegroups.com, samtools-devel
Hi Vadim,

Thanks for the bug report. These tags are stored correctly (you can do goby 1g alignment-to-text --format prototext test.sam on a small file to confirm this), but the code that converts back to BAM had an issue. 

When setting back tags on a samRecord (with picard) we would like to just do setAttribute("MD:Z:SKSJS:S::AJ"), but picard only offers setAttribute(tag,value), where value must have the specific type encoded in the letter after the first :. While we store tags without worrying about their type up to this point, in order to write back sam/BAM we need to look at the tag and figure out which type it is. The code that extracted the tag and its type did not consider you could also have a colon in the value and incorrectly truncated the value. A generic method that works on any attribute string would have been nice on the picard API. 

We have committed the fix for this bug in the 'stable' branch on github. Thanks again.

Fabien

Vadim Zalunin

unread,
Jul 4, 2012, 7:32:16 AM7/4/12
to Fabien Campagne, goby-fr...@googlegroups.com
Hello again,

I noticed that adding '--preserve-soft-clips' option slows sam-to-compat mode too much in cases where there are long soft clips. For example the first 3 reads in one of the BAM files have the following cigars:
95M5S
22S78M
80M20S

Without '--preserve-soft-clips' option compression works at a rate of tens of thousand items (records?)  per second. But with this option the rate is only 10-20 items per second. Is it a bug?

Thanks,
Vadim
vadim.vcf

Fabien Campagne

unread,
Jul 4, 2012, 7:56:11 AM7/4/12
to goby-fr...@googlegroups.com, Fabien Campagne
Hello Vadim,

Thanks for the feedback. It is not a bug when the data are imported correctly, but it is certainly strange that performance would drop so much.  Could you share the bam file (or a subset of a few million reads) where you observed this difference in performance?  We would like to profile the tool and try to optimize the step that is a bottleneck in such cases. Thanks, Fabien

Fabien Campagne

unread,
Jul 4, 2012, 8:52:55 AM7/4/12
to goby-fr...@googlegroups.com, Fabien Campagne
I tried filtering one of the datasets we work with to keep only SAM records with 20S in them. This yielded a small BAM file with relatively long soft-clips, but I am so far unable to replicate the problem you report: see below, timing looks very similar on this small dataset, with or without --preserve-soft-clips. If you can point us to your test dataset, could you also please copy/paste the exact command lines you ran that show the problem? Thanks.

The clips in our test file look like this:

20S16M2D44M1D20M
20S80M
20S8M1D17M1I4M1D50M

goby 1g  stc -i large-clips.bam -o large-clips --preserve-soft-clips -x MessageChunksWriter:codec=hybrid-1
...
Total logical entries written: 7234
...
real 0m3.066s
user 0m4.818s
sys 0m0.159s

goby 1g  stc -i large-clips.bam -o large-clips  -x MessageChunksWriter:codec=hybrid-1
...
Total logical entries written: 7234
...

real 0m2.973s
user 0m4.794s
sys 0m0.186s

Heng Li

unread,
Jul 5, 2012, 11:36:59 AM7/5/12
to Fabien Campagne, James Bonfield, samtools-devel, goby-fr...@googlegroups.com
On Jun 30, 2012, at 2:07 PM, Fabien Campagne wrote:



On Fri, Jun 29, 2012 at 9:55 AM, Heng Li <l...@sanger.ac.uk> wrote:

Heng, Your assumption is incorrect. We made different data representation choices, and there is a cost for the conversions  Goby <> SAM, which does not exist with Picard since its representation is aligned with SAM. To be more specific, we store spliced alignments as two aligned entries, not one as is done in SAM/BAM. This makes it possible to represent fusions natively without tricks (e.g., see what the TopHat group had to do to store fusion info in SAM format). We also store sequence variations differently: we don't store CIGAR strings but instead have a list of sequence variation data structures, which stores all the info. This list of differences is not exhaustive. Goby diverged from BAM when we believed there was an opportunity to improve program readability, improve program performance, or simplify common tasks. Note that while the representations are different, they provide similar functionality. 

So far as I am aware, the major difference between goby and BAM is that in goby reads and alignments are separated, while in BAM, they always come together. I am surprised that this difference will add such a great overhead to the conversion between the goby and BAM data representations. Reconstructing CIGAR from differences should be a fast operation. 

  
We designed Goby to store sequencing data the most effectively we can, so that we can compute with it. It comes with its own APIs (which we think are simpler to learn and use than picard). The APIs encode/decode data between disk and data structures in memory. The compression/decompression steps I refer to obviously include encoding/decoding to data structures (the Goby ones). Since the Goby data structures are different from the ones used in BAM/SAM, programmers used to SAM may find that their intuition about SAM is not very useful to predict performance with Goby. 

Decoding performance can be measured for a Goby alignment with the command:

goby 3g compact-file-stats alignment.entries [this will traverse the entire file to collect simple statistics, data are completely decompressed and decoded to memory with the Goby API]

You will find that performance of this process is in the ballpark of decoding an equivalent BAM alignment with samtools (when using the hybrid-1 codec for best compression), or is faster than samtools (when using the GZIP codec for best speed). The codec is an option that lets users/developers control the tradeoffs for a particular application.

With the default codec (what is the default, zlib or hybrid? I actually do not know how to change), samtools index/flatstat is 5 times as fast as compact-file-stats. Compact-file-stats is indeed much faster than compact-to-sam, but I really do not understand what overhead goby adds.

Heng

Fabien Campagne

unread,
Jul 5, 2012, 1:01:33 PM7/5/12
to Heng Li, samtools-devel, goby-fr...@googlegroups.com
On Thu, Jul 5, 2012 at 11:36 AM, Heng Li <l...@sanger.ac.uk> wrote:

On Jun 30, 2012, at 2:07 PM, Fabien Campagne wrote:



On Fri, Jun 29, 2012 at 9:55 AM, Heng Li <l...@sanger.ac.uk> wrote:

Heng, Your assumption is incorrect. We made different data representation choices, and there is a cost for the conversions  Goby <> SAM, which does not exist with Picard since its representation is aligned with SAM. To be more specific, we store spliced alignments as two aligned entries, not one as is done in SAM/BAM. This makes it possible to represent fusions natively without tricks (e.g., see what the TopHat group had to do to store fusion info in SAM format). We also store sequence variations differently: we don't store CIGAR strings but instead have a list of sequence variation data structures, which stores all the info. This list of differences is not exhaustive. Goby diverged from BAM when we believed there was an opportunity to improve program readability, improve program performance, or simplify common tasks. Note that while the representations are different, they provide similar functionality. 

So far as I am aware, the major difference between goby and BAM is that in goby reads and alignments are separated, while in BAM, they always come together. I am surprised that this difference will add such a great overhead to the conversion between the goby and BAM data representations. Reconstructing CIGAR from differences should be a fast operation. 

Again, your intuition may not be a good guide here. There are more differences than you seem to realize. 
  
We designed Goby to store sequencing data the most effectively we can, so that we can compute with it. It comes with its own APIs (which we think are simpler to learn and use than picard). The APIs encode/decode data between disk and data structures in memory. The compression/decompression steps I refer to obviously include encoding/decoding to data structures (the Goby ones). Since the Goby data structures are different from the ones used in BAM/SAM, programmers used to SAM may find that their intuition about SAM is not very useful to predict performance with Goby. 

Decoding performance can be measured for a Goby alignment with the command:

goby 3g compact-file-stats alignment.entries [this will traverse the entire file to collect simple statistics, data are completely decompressed and decoded to memory with the Goby API]

You will find that performance of this process is in the ballpark of decoding an equivalent BAM alignment with samtools (when using the hybrid-1 codec for best compression), or is faster than samtools (when using the GZIP codec for best speed). The codec is an option that lets users/developers control the tradeoffs for a particular application.

With the default codec (what is the default, zlib or hybrid? I actually do not know how to change

The default is GZip, see the tutorial: http://campagnelab.org/software/goby/tutorials/whats-new-in-goby-2-0/, section "Compressing alignments"

), samtools index/flatstat is 5 times as fast as compact-file-stats. Compact-file-stats is indeed much faster than compact-to-sam, but I really do not understand what overhead goby adds.

 
You are right that flagstat or index are faster. For instance (measuring from local disk, on a 20M reads file, see sizes below),

time samtools flagstat /scratchLocal/gobyweb/XAAOBVT-no-unmapped.bam >/dev/null

real 0m20.759s
user 0m20.190s
sys 0m0.564s

Since I am not sure how much flagstat/index decode after decompression (perhaps just flags or positions), I usually estimate BAM decompression speed like this:

time samtools view /scracthLocal/gobyweb/XAAOBVT-no-unmapped.bam >/dev/null [reads BAM, writes SAM to /dev/null]

real 0m53.983s
user 0m47.420s
sys 0m6.553s

With the same data in Goby format: time goby 1g cfs  /scratchLocal/gobyweb/XAAOBVT-gzip.entries

real 0m46.411s
user 1m1.232s
sys 0m0.836s

That's in the ballpark, yet about twice slower than samtools flagstat indeed. 

The hybrid codec is about three times slower than gzip on this file (in wall clock time):

real 2m39.674s
user 7m40.592s
sys 0m7.701s

File sizes are below for reference (19,929,026 100bp reads):
1.4G Jul  5 12:22 /scratchLocal/gobyweb/XAAOBVT-no-unmapped.bam
350M Jul  5 12:18 /scratchLocal/gobyweb/XAAOBVT-gzip.entries
107M Jul  5 12:42 /scratchLocal/gobyweb/XAAOBVT-hybrid.entries

Aaron Quinlan

unread,
Jul 5, 2012, 1:11:39 PM7/5/12
to Fabien Campagne, Heng Li, goby-fr...@googlegroups.com, samtools-devel



On Fri, Jun 29, 2012 at 9:55 AM, Heng Li <l...@sanger.ac.uk> wrote:

Heng, Your assumption is incorrect. We made different data representation choices, and there is a cost for the conversions  Goby <> SAM, which does not exist with Picard since its representation is aligned with SAM. To be more specific, we store spliced alignments as two aligned entries, not one as is done in SAM/BAM. This makes it possible to represent fusions natively without tricks (e.g., see what the TopHat group had to do to store fusion info in SAM format). We also store sequence variations differently: we don't store CIGAR strings but instead have a list of sequence variation data structures, which stores all the info. This list of differences is not exhaustive. Goby diverged from BAM when we believed there was an opportunity to improve program readability, improve program performance, or simplify common tasks. Note that while the representations are different, they provide similar functionality. 

So far as I am aware, the major difference between goby and BAM is that in goby reads and alignments are separated, while in BAM, they always come together. I am surprised that this difference will add such a great overhead to the conversion between the goby and BAM data representations. Reconstructing CIGAR from differences should be a fast operation. 

Again, your intuition may not be a good guide here. There are more differences than you seem to realize. 

Could you elaborate?  I am interested.

Best,
Aaron

Heng Li

unread,
Jul 5, 2012, 1:35:34 PM7/5/12
to Fabien Campagne, samtools-devel, goby-fr...@googlegroups.com
Samtools does not fully decodes auxiliary tags because in practice, we almost never decode all the tags. Nonetheless, the "view" command will spend a lot of time on formatting SAM, more than decompression+decoding. You should not take that as the decoding time, either. Flagstat/index represents the decoding speed for most typical applications (SNP calling, pileup, markduplicate, subsampling, extraction, indexing, etc).

I am also interested in the question Aaron was asking: what is bottleneck of Goby=>SAM conversion? In addition, why the wall-clock time is less than CPU time and sometimes a lot? Is goby multi-threaded by itself or the difference is caused by multithreaded garbage collection or other Java VM operations?

Heng

Fabien Campagne

unread,
Jul 5, 2012, 3:38:16 PM7/5/12
to Aaron Quinlan, Heng Li, goby-fr...@googlegroups.com, samtools-devel
Aaron, Heng,

I have posted a comparison of Goby and SAM representations at http://campagnelab.org/goby-sam-differences/ with examples. Hope this clarifies the differences.

You can easily understand that splicing differences can impact the performance of Goby>SAM since we need to collect all the possible splice parts before we can write a complete SAM entry.  Of course it is possible the conversion to SAM/BAM could be made much faster. Version 2 was our first release of compact-to-sam, we aimed for correctness and did not spend much time optimizing speed. 

Heng, the difference between wall-clock time and CPU time is indeed probably multi-threading. This code is written sequentially at this time, but it possible some of the components we use run multi-threaded in the JVM.
  
Fabien

Fabien Campagne

unread,
Jul 5, 2012, 4:01:15 PM7/5/12
to Heng Li, samtools-devel, goby-fr...@googlegroups.com
Heng,

Also, you will see better performance Goby>SAM if you use the Goby random access genome rather than an indexed fasta file (our cache is loaded entirely in memory for speed). You can build a cache like this:

$ goby 3g build-sequence-cache genome.fa -b genome-basename

will yield genome-basename.*, which you can use as follows:

$ goby 3g compact-to-sam  goby-basename  -o output.bam –genome genome-basename

The difference in speed can be dramatic when alignments are not sorted, because in this case the filesystem cache is not effective. 

Fabien

Fabien Campagne

unread,
Jul 6, 2012, 9:02:28 AM7/6/12
to Heng Li, samtools-devel, goby-fr...@googlegroups.com
On Thu, Jul 5, 2012 at 1:35 PM, Heng Li <l...@sanger.ac.uk> wrote:

In addition, why the wall-clock time is less than CPU time and sometimes a lot? Is goby multi-threaded by itself or the difference is caused by multithreaded garbage collection or other Java VM operations?

 
Heng, I was also a bit puzzled so I looked at this closer and found that the difference between user/CPU time and real/clock wall time was due to the JVM hotspot compiler optimizing the code in the background (optimization gets counted towards user/CPU time, but not in wall-clock because on this machine it could run in parallell on other cores). I could verify this by running the code via Nailgun. The first run was much longer, but once the code was optimized, subsequent runs were a bit better than the wall clock speeds I listed earlier. Five minutes is about the time HotSpot needs to optimize this code. On long running jobs we sometimes see a doubling of performance in the first few minutes as the code is rewritten on the fly by the optimizer, so I guess this is worth it. There is less of an impact for short running jobs if you use the --client JVM option, which should be the default on desktop/laptop class machines, but I ran this mini-benchmark on a server-class machine. 

Vadim Zalunin

unread,
Jul 6, 2012, 9:13:50 AM7/6/12
to Fabien Campagne, Heng Li, goby-fr...@googlegroups.com, samtools-devel
On 06/07/2012 14:02, Fabien Campagne wrote:


On Thu, Jul 5, 2012 at 1:35 PM, Heng Li <l...@sanger.ac.uk> wrote:

In addition, why the wall-clock time is less than CPU time and sometimes a lot? Is goby multi-threaded by itself or the difference is caused by multithreaded garbage collection or other Java VM operations?

 
Heng, I was also a bit puzzled so I looked at this closer and found that the difference between user/CPU time and real/clock wall time was due to the JVM hotspot compiler optimizing the code in the background (optimization gets counted towards user/CPU time, but not in wall-clock because on this machine it could run in parallell on other cores). I could verify this by running the code via Nailgun. The first run was much longer, but once the code was optimized, subsequent runs were a bit better than the wall clock speeds I listed earlier. Five minutes is about the time HotSpot needs to optimize this code. On long running jobs we sometimes see a doubling of performance in the first few minutes as the code is rewritten on the fly by the optimizer, so I guess this is worth it. There is less of an impact for short running jobs if you use the --client JVM option, which should be the default on desktop/laptop class machines, but I ran this mini-benchmark on a server-class machine.
-XX:CompileThreshold=10000    Number of method invocations/branches before compiling [-client: 1,500]
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

Vadim
vadim.vcf
Reply all
Reply to author
Forward
0 new messages