wondering how everybody operates the GA pipeline

4 views
Skip to first unread message

Aparna P

unread,
Aug 9, 2010, 9:36:49 AM8/9/10
to solexa
Hi,

I am curious to know how every body archive GArun .Right now I am
using rsync to copy Gigs of data to archiving location and it takes
few days for that.

Thx.

Jaemyun Lyu

unread,
Aug 9, 2010, 9:41:42 AM8/9/10
to solexa
How about using gigabit local network? It won't take such long time.

David Dooling

unread,
Aug 9, 2010, 9:46:13 AM8/9/10
to solexa

rsync, especially when using ssh, can incur significant overhead. If
you are on a private network, you can disable encryption (--rsh=rsh)
and get better speed up.

What all are you archiving? The whole run directory? Does the run
directory contain images? cifs? bcls? qseqs? Just the GERALD
directory? If you are archiving the whole run directory, you might
want to think about archiving less, e.g., a BAM with all the sequences,
qualities, and alignments.

--
David Dooling
The Genome Center at Washington University
http://genome.wustl.edu/

Jesse Becker

unread,
Aug 9, 2010, 10:08:33 AM8/9/10
to sol...@googlegroups.com
On Mon, Aug 9, 2010 at 09:46, David Dooling <ddoo...@wustl.edu> wrote:
> On Mon, Aug 09, 2010 at 06:41:42AM -0700, Jaemyun Lyu wrote:
>> How about using gigabit local network? It won't take such long time.
>>
>> On Aug 9, 10:36 pm, Aparna P <aparna.pal...@gmail.com> wrote:
>> > I am curious to know how every body  archive GArun .Right now I am
>> > using rsync to copy Gigs of data to archiving location and it takes
>> > few days for that.
>
> rsync, especially when using ssh, can incur significant overhead.  If
> you are on a private network, you can disable encryption (--rsh=rsh)
> and get better speed up.

Agreed. If you have to use ssh, you can get slightly better data
rates by usnig the arcfour cipher, instead of the default AES or 3des.
There's also HPN-SSH[1] which does a number of things in order to
speed up SSH transfers, including adding a "no-encryption" cipher
(authentication is *ALWAYS* encrypted; passwords are never sent in the
clear).

[1] http://www.psc.edu/networking/projects/hpn-ssh/

> What all are you archiving?  The whole run directory?  Does the run
> directory contain images? cifs? bcls? qseqs?  Just the GERALD
> directory?  If you are archiving the whole run directory, you might
> want to think about archiving less, e.g., a BAM with all the sequences,
> qualities, and alignments.

If it takes more than a few hours, you may have network or disk IO
problems. The most recent analysis directory we have is about 73G.
You should be able to transfer this on a clean network quickly. I
just did a quick test using a 3,511MB file, reading from an rsync
export (not using SSH for tunneling), and maxed out at about 40MB/sec,
and averaged around 30MB/s (both systems involved were busy doing
other things; this was not a "clean" benchmarking envrionment). For a
73G directory, you are looking at about 250 seconds for pure data
transfer, plus a few minutes for rsync to walk the directory tree.
All told, well under an hour. :)


--
Jesse Becker
Every cloud has a silver lining, except for the mushroom-shaped ones,
which come lined with strontium-90.

Aparna P

unread,
Aug 9, 2010, 6:04:48 PM8/9/10
to solexa

I destroy images and intensities.Sync rest of them.
We do have giga bit links and I ssh to that server to sync files-
Still get 2days to sync.

David Dooling

unread,
Aug 9, 2010, 10:55:05 PM8/9/10
to solexa
On Mon, Aug 09, 2010 at 03:04:48PM -0700, Aparna P wrote:
> I destroy images and intensities.Sync rest of them.
> We do have giga bit links and I ssh to that server to sync files-
> Still get 2days to sync.

For a 50G run, the Basecalls/Bustard directory should be about 500 GB.
If it is taking 2 days, you are getting about 24 Mbps from your 1 Gbps
link. You have a bottleneck somewhere and the most likely culprits
have already be posted on this thread. Someone at your site will need
to do some sleuthing.

Aparna P

unread,
Aug 10, 2010, 10:59:27 AM8/10/10
to solexa
Thanks very much for all expert suggestions and will try to change
the cipher options and see what happens.
Once again Thanks all.

On Aug 9, 10:55 pm, David Dooling <ddool

hemant kelkar

unread,
Aug 10, 2010, 11:34:50 AM8/10/10
to sol...@googlegroups.com
Aparna,

If you are transferring the data across VLAN's/routers it is possible that an intrusion prevention device (a hardware device used to scan for viruses/malware on networks) may be throttling your data transfer.  You may wish to talk with your network folks to see if that is a possibility. 2 days for an rsync sounds very long indeed.

- Hemant


--
You received this message because you are subscribed to the Google Groups "solexa" group.
To post to this group, send email to sol...@googlegroups.com.
To unsubscribe from this group, send email to solexa+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solexa?hl=en.


Jesse Becker

unread,
Aug 10, 2010, 11:48:31 AM8/10/10
to sol...@googlegroups.com
Switching ciphers will help, but only a little (a few %, at most).
You're better off using a direct rsync connection if possible, and
avoid tunneling entirely (and also looking for other bottlenecks...)

> --
> You received this message because you are subscribed to the Google Groups "solexa" group.
> To post to this group, send email to sol...@googlegroups.com.
> To unsubscribe from this group, send email to solexa+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/solexa?hl=en.
>
>

--

Reply all
Reply to author
Forward
0 new messages