The future of sambamba - what to work on?

82 views
Skip to first unread message

Pjotr Prins

unread,
Nov 12, 2017, 3:45:28 AM11/12/17
to sambamba-discussion
Hi all,

Sambamba has turned out to be a successful tool with over 70 citations and counting. Artem Tarasov wrote most of the software and he did an amazing job, initially as our Open Bioinformatics Foundation (OBF) Google Summer of Code student and later as an EMBL scientific programmer! Now he no longer can spend much time on it I am freeing up some of my time and hope we can get some more community traction. Unlike some other C/C++ software, sambamba code is really clear to read, witness the speed at which Artem fixes bugs over time.

All software has bugs, and maintenance should always be a priority (where it matters).

The original sambamba was positioned as a samtools improvrement. In many ways it still is. Even so, samtools may have caught up by now in some areas - we need to do the metrics soon - and I don't feel we need to provide a competing solution. Sambamba should really do well what it is good at and leave the rest to other tools. I will also involve the samtools authors in deciding what to work on. My personal interest is genetics and applying GWA to data coming out of the sequencing center efficiently. I am also working on GEMMA for that reason and think I can combine the two efforts. I am less interested in helping people running sambamba on their desktop, though we will keep providing binaries for Linux.

Anyone reading this I would like to ask you to reply and state what part of sambamba is considered most important for your work. Sambamba can be speedy and efficient and we can work to make it even faster and contribute to reducing computational requirements (reducing carbon footprint is always a good idea). We think sambamba is running every second somewhere on the planet. Therefore, increasing speed and reducing power consumption is a priority for this tool. So, outside my own purpose in genetics, I am not interested in adding new functionality to sambamba, though anyone is welcome to contribute new code (of course).

What we need to do first is improve what sambamba already does well. You can help by answering the following questions:

Q: how are you using sambamba and at what scale?

Q: what features do you think most important?

I think we'll add a ping-ware option to sambamba which will tell us how sambamba is used (we'll be strong on privacy, and you can opt out). I think that would be very useful information and potentially help raise funding at some point in the future. I think most bioinformaticians won't mind.

Pj.

skant...@gmail.com

unread,
Nov 13, 2017, 10:08:00 AM11/13/17
to sambamba-discussion
Hi Pjotr,
First off a massive thank you and Artem. You have definitely delivered an exceptional piece of software. Artem has always been responsive to issues and requests on github.
I work at illumina and computational efficiency is one of our core drivers. I've been using sambamba mainly for downsampling and marking duplicates where it performs massively better than other common tools (samtools, Picard). It is now part of our whole-genome dev pipeline so we're using it at quite a large scale. Downsampling is one of the cornerstones of what we do since all performance specs are set against a particular coverage so we need to downsample to that coverage to test them.

I've also found the filtering quite intuitive and powerful as well.

Going forward, at least for us, I think more efficient (memory and time-wise) duplicate marking would be a top priority since the tool struggles with high-coverage whole-genome bams (e.g. 75x+)

Let me know if there's anything I can help with to keep the tool well maintained.

Regards,
Stathis

Pjotr Prins

unread,
Nov 14, 2017, 3:32:53 AM11/14/17
to sambamba-discussion
Thanks Stathis for the kind words! Can you send me an example of the sambamba commands you are actually using so I can add them to the performance tests? One of the first steps will be to visualize performance (on a website).

Pj.

skant...@gmail.com

unread,
Nov 14, 2017, 6:46:37 AM11/14/17
to sambamba-discussion
Hi Pjotr,
Here's the command we use:
sambamba markdup --hash-table-size=2097152 --overflow-list-size=160000 --io-buffer-size=1024 in.bam out.bam
Regarding a 75x bam, you could combine several publicly available bams to reach that depth. E.g. you could make a free account on BaseSpace and use some of the public datasets there.

Thanks,
Stathis

Brad Chapman

unread,
Nov 14, 2017, 10:05:49 AM11/14/17
to sambamba-discussion
Pjotr and Artem;
Great news that you have time and support for working on sambamba. For us the
number one help would be improving stability of the multicore steps. We're
still stuck with sambamba segfaulting on multiple platforms and ended up
working around it by transferring functionality to either samtools (which has
a lot more multicore now) or mosdepth (which is very fast and flexible for
depth calculations).

sambamba is the only tool I know that correctly subsets a BAM file using a BED
file. samtools doesn't use the indices so is very slow, and we work around
this by sectioning in parts using the command line regions.

In terms of new features, some useful thing would be:

- First class CRAM support. I'm not sure where this is currently at in
  sambamba but I suspect we'll be moving more and more to CRAM soon.

- Support downsampling BAMs/CRAMs to a maximum coverage (only downsample if
  above a certain coverage). We added this in VariantBam and it's really
  useful to reducing out of control runtimes on WGS runs in collapsed repeats
  but we're stuck trying to make it fast enough to be useful:

  https://github.com/walaj/VariantBam/issues/13

Hope this helps with idea, and thanks again for all your great work on
sambamba,
Brad

Pjotr Prins

unread,
Nov 17, 2017, 12:37:49 AM11/17/17
to Brad Chapman, sambamba-discussion
On Tue, Nov 14, 2017 at 07:05:49AM -0800, Brad Chapman wrote:
> Great news that you have time and support for working on
> sambamba. For us the number one help would be improving stability
> of the multicore steps. We're still stuck with sambamba
> segfaulting on multiple platforms and ended up working around it
> by transferring functionality to either samtools (which has a lot
> more multicore now) or mosdepth (which is very fast and flexible
> for depth calculations).

The segfaulting should be easy to fix when I can reproduce it. Are you
still using the 0.6.6 binary - because I think it still has the
threading segfault that was fixed in ldc 1.0. Using the pre-release
binary on github should be OK.

If you help me reproduce it, I can help you. Let's start with a
command line that is known to crash (now and then).

> sambamba is the only tool I know that
> correctly subsets a BAM file using a BED file. samtools doesn't
> use the indices so is very slow, and we work around this by
> sectioning in parts using the command line regions.

> In terms of new features, some useful thing would be:
> - First class CRAM support. I'm not sure where this is currently at in
> sambamba but I suspect we'll be moving more and more to CRAM soon.
> - Support downsampling BAMs/CRAMs to a maximum coverage (only
> downsample if above a certain coverage). We added this in VariantBam and it's
> really useful to reducing out of control runtimes on WGS runs in collapsed
> repeats but we're stuck trying to make it fast enough to be useful:
> https://github.com/walaj/VariantBam/issues/13

Multithreaded programming in C++...

> Hope this helps with idea, and thanks again for all your great work on
> sambamba,

Sounds to me that we should work on the segfaults for sure. mpileup
and downsampling could be next. At least there is a clear demand :)

Pj.

Reply all
Reply to author
Forward
0 new messages