Using EC2 with Biopieces

Evan Adams

unread,

Dec 27, 2013, 3:14:56 PM12/27/13

to biop...@googlegroups.com

Hello,

The pathway I am trying to set up includes several restriction digests of the human genome, which requires more computing power than my laptop provides. To solve this problem, I was hoping to employ cloud computing. I am familiar with launching instances, but have never configured an image before. Therefore, I was wondering if anyone had an EC2 image with a functional version of Biopieces, or could provide guidance about how to set one up. Thank you in advance for your assistance, and happy holidays.

Best,
Evan

Martin Asser Hansen

unread,

Dec 27, 2013, 4:05:52 PM12/27/13

to biop...@googlegroups.com

I have no idea about EC2, but most Biopieces stuff should work fine on a laptop. What is it more precisely you are trying to do?

Cheers,

Martin

--
You received this message because you are subscribed to the Google Groups "biopieces" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopieces+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Evan Adams

unread,

Dec 27, 2013, 4:17:38 PM12/27/13

to biop...@googlegroups.com, ma...@maasha.dk

Hi Martin,

You are correct. I have biopieces installed on my laptop and it works great.

The first command I run is:
read_fasta -i human_genome.fasta | digest_seq -p AGCT -c 2

This works fine, except for the fact that the human genome file is very large. I tried doing only a single chromosome, which unzipped is around 250MB, and the task was not completed overnight. As I need to run several of these digests (~10) forwards and backwards for the entire genome, I figure that I need more RAM and CPU to complete the task. Any ideas?

Martin Asser Hansen

unread,

Dec 27, 2013, 4:50:55 PM12/27/13

to biop...@googlegroups.com

digest_seq may be a bit under performing since it is written in pure Ruby. You could investigate the alternative patscan_seq which is memory efficient and much faster, however, patscan_seq is a generic pattern scanner and will need some more work on your side to report correct restriction digests.

Cheers,

Martin

Evan Adams

unread,

Jan 11, 2014, 6:45:28 PM1/11/14

to biop...@googlegroups.com, ma...@maasha.dk

Hi Martin,

I am still working on the same problem and am getting an error that I can't figure out. My input command is:

read_fasta -i hg19.fa | digest_seq -p AGCT -c 2 | count_records --data_out=count.txt

and I get the following error message:

biopieces.rb:118:in `write': Broken pipe - <STDOUT> (Errno::EPIPE)

from /home/jj/biopieces/code_ruby/lib/maasha/biopieces.rb:118:in `<<'

from /home/jj/biopieces/code_ruby/lib/maasha/biopieces.rb:118:in `puts'

from /home/jj/biopieces/bp_bin/read_fasta:54:in `block (4 levels) in <main>'

from /home/jj/biopieces/code_ruby/lib/maasha/filesys.rb:102:in `each'

from /home/jj/biopieces/bp_bin/read_fasta:53:in `block (3 levels) in <main>'

from /home/jj/biopieces/code_ruby/lib/maasha/filesys.rb:74:in `open'

from /home/jj/biopieces/bp_bin/read_fasta:52:in `block (2 levels) in <main>'

from /home/jj/biopieces/bp_bin/read_fasta:51:in `each'

from /home/jj/biopieces/bp_bin/read_fasta:51:in `block in <main>'

from /home/jj/biopieces/code_ruby/lib/maasha/biopieces.rb:89:in `open'

from /home/jj/biopieces/bp_bin/read_fasta:40:in `<main>'

If I read the error right, Line 118 is

# Method to write a Biopiece record to _ios_.

def puts(foo)

@ios << foo.to_s

end

Please advise.

Evan Adams

unread,

Jan 11, 2014, 7:47:10 PM1/11/14

to biop...@googlegroups.com, ma...@maasha.dk

I have gotten this to work successfully on a chromosome by chromosome basis. I am using the hg19 assembly available at the UCSC Genome Browser. I suspect it must be something with the compiled fasta file's formatting (hg19.fa). If you have any ideas Martin let me know, but I am managing by manually running each chromosome.

Martin Asser Hansen

unread,

Jan 13, 2014, 7:33:13 AM1/13/14

to biop...@googlegroups.com

Hi Evan,

I was able to reproduce this error, and I managed to make it disappear by upgrading some old code, but I really never found out what specifically caused this. Having said this, digest_seq is very slow for large sequences like eukaryotes. It can be improved, but that will require more work on my part.

Run bp_update && bp_test

Cheers,

Martin

Martin Asser Hansen

unread,

Jan 13, 2014, 7:36:19 AM1/13/14

to biop...@googlegroups.com

By the way. Doing basic stats on something like digests you could do something like:

read_fasta -n 1 -i hg19.fa | digest_seq -p AGCT -c 2 | grab -p DIGEST | analyze_vals -x | write_tab -cCpx

+----------+------------+-------+-----+-----+-------+--------+

| KEY | TYPE | COUNT | MIN | MAX | SUM | MEAN |

+----------+------------+-------+-----+-----+-------+--------+

| SEQ | Alphabetic | 4 | 32 | 502 | 1,000 | 250.0 |

| REC_TYPE | Alphabetic | 4 | 6 | 6 | 24 | 6.0 |

| SEQ_LEN | Numeric | 4 | 32 | 502 | 1,000 | 250.0 |

| SEQ_NAME | Alphabetic | 4 | 12 | 14 | 53 | 13.25 |

| S_BEG | Numeric | 4 | 0 | 861 | 2,017 | 504.25 |

| S_END | Numeric | 4 | 31 | 999 | 1,857 | 464.25 |

+----------+------------+-------+-----+-----+-------+--------+

Reply all

Reply to author

Forward