Biopieces 2.0 white paper
Compatibility with Biopieces1.0
Biopieces2.0 is going to be the next-generation of Biopieces. Biopieces2.0 is not a drop-in replacement of Biopieces1.0, but a Software Development Kit (SDK) written in Ruby. In fact, Biopieces2.0 is not backwards compatible with Biopieces1.0. The reasons for creating Biopieces2.0 are improved installation, development, maintainability, scalability and performance. While Biopieces2.0 is optimized for writing simple scripts with pipelines to be run it is also possible to do quick and dirty analysis using a Ruby shell. Biopieces2.0 is twice as fast at Biopieces1.0.
The Biopieces1.0 project was started in August 1997. The idea was to write a set of tools that could be piped together using the command line in a Unix environment in such a way that the output of one tool was used as input to the next tool. This was by no means a new idea; in fact it is native to the Unix environment where tools such as cat, cut, grep and sort work in this way. However, these tools are generic and not optimal for working with biological data and sequence data in particular. One could use sed, awk, Perl or some such tool, but those requires quite an often overwhelming effort for the uninitiated user. Also, other systems with the same idea as the Biopieces exists, e.g. the Boulder system. However, Biopieces focuses on usability; getting something working fast and simple.
As of now Biopeices2.0 is not backwards compatible with Biopieces2.0. Legacy functionality was tested, but performance was suboptimal. So for now Biopieces2.0 can be used in Ruby scripts and interactive Ruby shells like irb and pry, but not on the command line unless as Ruby one-liners.
Installation will be as simple as installing a Ruby gem:
$ gem install biopieces |
The requirements will be Ruby 2.0 or later. Dependencies like BLAST, BWA, Muscle, etc must still be handled by the user (or perhaps a docker image in the future).
The idea is, as in Biopieces1.0, to pass data records between different command each doing one specific task to records with the correct attributes. However, instead of doing this on the command line you now do this in Ruby scripts in a two step way: first a pipeline is created, and then that pipeline is run. Each command is running in a process of it own as in Biopieces1.0. This is achieved by forking the process for each command and sending data between processes using msgpack. Obviously, it would be more efficient to use multithreading where you can send objects between processes and save the overhead of marshalling objects. However, since MRI Ruby is limited to one compute core when using threads, this prevents parallel computing on multiple cores.
The below script sets up a pipeline that reads all FASTA entries from the file “test.fna” and then uses grab to filter all records where the sequence matches “ATCG” and then a histogram of sequence lengths is save to the file “plot.png” and then the filtered FASTA entries is written to the file “out.fna”.
#!/usr/bin/env ruby require 'biopieces' p = BP.new. read_fasta(input: "test.fna"). grab(keys: :SEQ, select: "ATCG"). plot_histogram(key: :SEQ_LEN, terminal: :png, output: "plot.png"). write_fasta(output: "out.fna"). run(progress: true) |
This script demonstrates how multiple pipelines can be created and combined. In the end two pipelines are run, one consisting of p1 + p2 and one consisting of p1 + p3.
#!/usr/bin/env ruby require 'biopieces' p1 = BP.new.read_fasta(input: "test.fna") p2 = BP.new.grab(keys: :SEQ, select: "ATCG"). plot_histogram(key: :SEQ_LEN, terminal: :png, output: "select.png") p3 = BP.new.grab(key: :SEQ, reject: "ATCG"). plot_histogram(key: :SEQ_LEN, terminal: :png, output: "reject.png") p4 = p1 + p3 (p1 + p2).write_fasta(output: "select.fna").run(progress: true) p4.write_fasta(output: "reject.fna").run(progress: true) |
This script demonstrates how to run multiple pipelines in parallel. Here we filter pair-end FASTQ entries from a list of samples, each consisting of a sample name, a forward read file and a reverse read file is processed in parallel.
#!/usr/bin/env ruby require 'biopieces' samples = read_samples("samples.txt) Parallel.each(samples, in_threads: 20) do |sample| BP.new. read_fastq(input: sample[1], input2: sample[2], encoding: :base_33). grab(keys: :SEQ, select: "ATCG"). write_fastq(output: "#{sample[0]}_filted.fastq.bz2", bzip2: true). run end |
Here we use the alias bp which is set in .bashrc like this: alias bp='ruby -r biopieces':
$ bp -e 'BP.new.read_fasta(input: "test_big.fna", first: 1000).plot_histogram(key: :SEQ_LEN).run' |
While Ruby scripts are great for setting up pipelines the command line is a favored choice for quick and dirty everyday work. However, Biopieces2.0 doesn’t work from the UNIX command line, but works great from the interactive ruby shell irb:
irb(main):001:0> require ‘biopieces’ => true irb(main):002:0> Pipe.new.add(:read_fasta, input: “in.fna”). irb(main):003:0* add(swapcase_seq). irb(main):004:0* add(write_fasta, output: “out.fna”).run => true |
Or the more advanced alternative pry that supports tab completion, indentation and syntax highlighting:
A wonderful advance of using these shells is that errors will be raised in case of typos and bad options while setting up the pipeline, so that these can be corrected before actual running of the pipeline.
A history file is kept in $USER/.biopieces_history and each time run is called a history entry is added to this file:
BP.new.read_fasta(input: "test_big.fna", first: 100).plot_histogram(key: :SEQ_LEN).run BP.new.read_fasta(input: "test_big.fna", first: 100).plot_histogram(key: :SEQ_LEN).run BP.new.read_fasta(input: "test_big.fna", first: 10).plot_histogram(key: :SEQ_LEN).run BP.new.read_fasta(input: "test_big.fna").plot_histogram(key: :SEQ_LEN).run BP.new.read_fasta(input: "test_big.fna", first: 1000).plot_histogram(key: :SEQ_LEN).run |
Thus it is possible to redo the last pipeline by pasting the line in irb or pry or a Ruby one-liner.
A log file is kept in $USER/.biopieces_log and each time run is called an entry consisting of the pipeline, run status and exit status is added to this file:
BP.new.read_fasta(input: "test_big.fna", first: 1000).plot_histogram(key: :SEQ_LEN).run {:status=> [{:command=>:read_fasta, :options=>{:input=>["test_big.fna"], :first=>1000}, :records_in=>0, :records_out=>1000, :time_elapsed=>"2.002744"}, {:command=>:plot_histogram, :options=> {:key=>:SEQ_LEN, :terminal=>:dumb, :title=>"Histogram", :xlabel=>:SEQ_LEN, :ylabel=>"n"}, :records_in=>1000, :records_out=>0, :time_elapsed=>"2.007338"}], :time_start=>2014-06-26 10:24:01 +0200, :time_stop=>2014-06-26 10:24:03 +0200, :time_elapsed=>2.106092} OK |
Using the email option in run will cause an email with the run status to be sent to that address upon completion of the pipeline:
run(email: "ma...@maasha.dk") |
Using the progress option in run will display a continuously updated progress report where you for each command in the pipeline can see how many records were parsed and emitted along with elapsed time etc:
run(progress: true) |
Testing coverage will be monitored with the Ruby Gem simplecov the goal being 100% tested code:
Documentation moved to Rdoc instead of Google Code’s wiki format. Documentation coverage goal is 100%. As a result documentation is available instantly using the ri tool:
$ ri dump |
or through a web interface:
Source code is located on GitHub and git is used for version control.
The Biopieces Ruby gem is located on Rubyforge and semantic versioning is used.
If the community believe that Biopeices2.0 is a good idea and no design flaws surfaces, I will proceed development. This implies that I will stop development of Biopieces1.0. Others are welcome to take over.
Feedback from users in the form of suggestions and bug reports are highly valued.
Ruby savvy contributors are most welcome.
Martin Asser Hansen
Cand. Scient., PhD
Bioinformatician
Molecular Microbial Ecology Group
Host-Microbiome Interaction Group
University of Copenhagen
Universitetsparken 15
DK-2100
DENMARK
E-mail: ma...@maasha.dk
Skype: martinasserhansen
Website: www.maasha.dk
Mobile: +45 60722228