Biopieces2.0

144 views
Skip to first unread message

Martin Asser Hansen

unread,
Jun 27, 2014, 7:29:44 AM6/27/14
to biop...@googlegroups.com
Hello all,

For a long time I have been thinking about the next generation of Biopieces. And for a long time I have been working on a prototype. This prototype is now alive at GitHub:


and available as a Ruby gem that can be installed with the command "gem install biopieces".

Have a look at the white paper below (also attached as PDF):


Please share any thoughts!


Cheers,


Martin

Biopieces 2.0 white paper

Table of content

Table of content

Summary

History

Compatibility with Biopieces1.0

Installation

Requirements

Principle

Code examples

A simple Ruby script

Combining multiple pipelines

Running pipelines in parallel

Ruby one-liner

Interactive Ruby shell

New features

History file

Log file

Email notification

Progress reporting

Improved testing

Documentation

Development

Future

Contributing

Contact


Summary

Biopieces2.0 is going to be the next-generation of Biopieces. Biopieces2.0 is not a drop-in replacement of Biopieces1.0, but a Software Development Kit (SDK) written in Ruby. In fact, Biopieces2.0 is not backwards compatible with Biopieces1.0. The reasons for creating Biopieces2.0 are improved installation, development, maintainability, scalability and performance. While Biopieces2.0 is optimized for writing simple scripts with pipelines to be run it is also possible to do quick and dirty analysis using a Ruby shell. Biopieces2.0 is twice as fast at Biopieces1.0.


History

The Biopieces1.0 project was started in August 1997. The idea was to write a set of tools that could be piped together using the command line in a Unix environment in such a way that the output of one tool was used as input to the next tool. This was by no means a new idea; in fact it is native to the Unix environment where tools such as cat, cut, grep and sort work in this way. However, these tools are generic and not optimal for working with biological data and sequence data in particular. One could use sed, awk, Perl or some such tool, but those requires quite an often overwhelming effort for the uninitiated user. Also, other systems with the same idea as the Biopieces exists, e.g. the Boulder system. However, Biopieces focuses on usability; getting something working fast and simple.

Compatibility with Biopieces1.0

As of now Biopeices2.0 is not backwards compatible with Biopieces2.0. Legacy functionality was tested, but performance was suboptimal. So for now Biopieces2.0 can be used in Ruby scripts and interactive Ruby shells like irb and pry, but not on the command line unless as Ruby one-liners.

Installation

Installation will be as simple as installing a Ruby gem:


$ gem install biopieces

Requirements

The requirements will be Ruby 2.0 or later. Dependencies like BLAST, BWA, Muscle, etc must still be handled by the user (or perhaps a docker image in the future).

Principle

The idea is, as in Biopieces1.0, to pass data records between different command each doing one specific task to records with the correct attributes. However, instead of doing this on the command line you now do this in Ruby scripts in a two step way: first a pipeline is created, and then that pipeline is run. Each command is running in a process of it own as in Biopieces1.0. This is achieved by forking the process for each command and sending data between processes using msgpack. Obviously, it would be more efficient to use multithreading where you can send objects between processes and save the overhead of marshalling objects. However, since MRI Ruby is limited to one compute core when using threads, this prevents parallel computing on multiple cores.

Code examples

A simple Ruby script

The below script sets up a pipeline that reads all FASTA entries from the file “test.fna” and then uses grab to filter all records where the sequence matches “ATCG” and then a histogram of sequence lengths is save to the file “plot.png” and then the filtered FASTA entries is written to the file “out.fna”.


#!/usr/bin/env ruby


require 'biopieces'


p = BP.new.

read_fasta(input: "test.fna").

grab(keys: :SEQ, select: "ATCG").

plot_histogram(key: :SEQ_LEN, terminal: :png, output: "plot.png").

write_fasta(output: "out.fna").

run(progress: true)


Combining multiple pipelines

This script demonstrates how multiple pipelines can be created and combined. In the end two pipelines are run, one consisting of p1 + p2 and one consisting of p1 + p3.


#!/usr/bin/env ruby


require 'biopieces'


p1 = BP.new.read_fasta(input: "test.fna")

p2 = BP.new.grab(keys: :SEQ, select: "ATCG").

    plot_histogram(key: :SEQ_LEN, terminal: :png, output: "select.png")

p3 = BP.new.grab(key: :SEQ, reject: "ATCG").

    plot_histogram(key: :SEQ_LEN, terminal: :png, output: "reject.png")

p4 = p1 + p3

(p1 + p2).write_fasta(output: "select.fna").run(progress: true)

p4.write_fasta(output: "reject.fna").run(progress: true)


Running pipelines in parallel

This script demonstrates how to run multiple pipelines in parallel. Here we filter pair-end FASTQ entries from a list of samples, each consisting of a sample name, a forward read file and a reverse read file is processed in parallel.


#!/usr/bin/env ruby


require 'biopieces'


samples = read_samples("samples.txt)


Parallel.each(samples, in_threads: 20) do |sample|

 BP.new.

 read_fastq(input: sample[1], input2: sample[2], encoding: :base_33).

 grab(keys: :SEQ, select: "ATCG").

 write_fastq(output: "#{sample[0]}_filted.fastq.bz2", bzip2: true).

 run

end


Ruby one-liner

Here we use the alias bp which is set in .bashrc like this: alias bp='ruby -r biopieces':


$ bp -e 'BP.new.read_fasta(input: "test_big.fna", first: 1000).plot_histogram(key: :SEQ_LEN).run'


Interactive Ruby shell

While Ruby scripts are great for setting up pipelines the command line is a favored choice for quick and dirty everyday work. However, Biopieces2.0 doesn’t work from the UNIX command line, but works great from the interactive ruby shell irb:


irb(main):001:0> require ‘biopieces’

=> true

irb(main):002:0> Pipe.new.add(:read_fasta, input: “in.fna”).

irb(main):003:0* add(swapcase_seq).

irb(main):004:0* add(write_fasta, output: “out.fna”).run

=> true


Or the more advanced alternative pry that supports tab completion, indentation and syntax highlighting:

Screen Shot 2014-06-26 at 10.24.15 AM.png


A wonderful advance of using these shells is that errors will be raised in case of typos and bad options while setting up the pipeline, so that these can be corrected before actual running of the pipeline.

New features

History file

A history file is kept in $USER/.biopieces_history and each time run is called a history entry is added to this file:


BP.new.read_fasta(input: "test_big.fna", first: 100).plot_histogram(key: :SEQ_LEN).run

BP.new.read_fasta(input: "test_big.fna", first: 100).plot_histogram(key: :SEQ_LEN).run

BP.new.read_fasta(input: "test_big.fna", first: 10).plot_histogram(key: :SEQ_LEN).run

BP.new.read_fasta(input: "test_big.fna").plot_histogram(key: :SEQ_LEN).run

BP.new.read_fasta(input: "test_big.fna", first: 1000).plot_histogram(key: :SEQ_LEN).run


Thus it is possible to redo the last pipeline by pasting the line in irb or pry or a Ruby one-liner.

Log file

A log file is kept in $USER/.biopieces_log and each time run is called an entry consisting of the pipeline, run status and exit status is added to this file:

BP.new.read_fasta(input: "test_big.fna", first: 1000).plot_histogram(key: :SEQ_LEN).run

{:status=>

 [{:command=>:read_fasta,

   :options=>{:input=>["test_big.fna"], :first=>1000},

   :records_in=>0,

   :records_out=>1000,

   :time_elapsed=>"2.002744"},

  {:command=>:plot_histogram,

   :options=>

    {:key=>:SEQ_LEN,

     :terminal=>:dumb,

     :title=>"Histogram",

     :xlabel=>:SEQ_LEN,

     :ylabel=>"n"},

   :records_in=>1000,

   :records_out=>0,

   :time_elapsed=>"2.007338"}],

:time_start=>2014-06-26 10:24:01 +0200,

:time_stop=>2014-06-26 10:24:03 +0200,

:time_elapsed=>2.106092}

OK

Email notification

Using the email option in run will cause an email with the run status to be sent to that address upon completion of the pipeline:


run(email: "ma...@maasha.dk")


Progress reporting

Using the progress option in run will display a continuously updated progress report where you for each command in the pipeline can see how many records were parsed and emitted along with elapsed time etc:


run(progress: true)


Improved testing

Testing coverage will be monitored with the Ruby Gem simplecov the goal being 100% tested code:

Screen Shot 2014-06-26 at 9.51.04 AM.png

Documentation

Documentation moved to Rdoc instead of Google Code’s wiki format. Documentation coverage goal is 100%. As a result documentation is available instantly using the ri tool:


$ ri dump



Screen Shot 2014-06-26 at 9.56.07 AM.png

or through a web interface:

Screen Shot 2014-06-26 at 9.53.08 AM.png

Development

Source code is located on GitHub and git is used for version control.


The Biopieces Ruby gem is located on Rubyforge and  semantic versioning is used.

Future

If the community believe that Biopeices2.0 is a good idea and no design flaws surfaces, I will proceed development. This implies that I will stop development of Biopieces1.0. Others are welcome to take over.

Contributing

Feedback from users in the form of suggestions and bug reports are highly valued.


Ruby savvy contributors are most welcome.

Contact


Martin Asser Hansen

Cand. Scient., PhD


Bioinformatician

Molecular Microbial Ecology Group

Host-Microbiome Interaction Group

University of Copenhagen

Universitetsparken 15

DK-2100

DENMARK


E-mail:  ma...@maasha.dk

Skype:   martinasserhansen

Website: www.maasha.dk

Mobile:  +45 60722228


Biopieces2.0whitepaper.pdf
Reply all
Reply to author
Forward
0 new messages