How much memory do I need when deal with large data?

45 views

Skip to first unread message

yexiay...@gmail.com

unread,

Dec 17, 2015, 9:52:49 PM12/17/15

to AftrRAD

Hi ,Mike,

I have a PE double digested RAD data, about 950G, how much memory do I need when I analyse this data? and my sequence length has 140bp, Is it OK for this pipeline to deal with ?

also, how is the memory used by ArtrRAD compared with pyRAD, larger or smaller?

Mike Sovic

unread,

Dec 18, 2015, 1:42:30 PM12/18/15

to AftrRAD

Hi,

I don't think the sequence length should matter - as long as all sequences are the same length.

Note that AftrRAD is not set up to run Paired End data - we've been thinking about how to best do this, but I haven't come up with an approach I'm really comfortable with yet. For now, I am only recommending running single end data with AftrRAD.

I really don't know much about the memory requirements for PyRAD, but if anyone else can speak to this, please do.

In terms of memory requirements for AftrRAD, they depend on a few factors - I'll try to give some guidelines below for determining how much you might need (though not making any promises with any of this - just some general rules of thumb)…

For RAM...

If running demultiplexed data on a single processor (not in parallel)…

Current version (4.1): add up all of the individual file sizes (fastq files), divide this number by 3, and that's approximately how much RAM you might need. It's been brought to my attention that this is problematic for some large datasets, so I have a fix for the next version…

Upcoming version (v5 - should be available soon): Take the size of the largest file, divide by 3.

If running demultiplexed data in parallel…

Current version (4.1): same as for current version above.

Upcoming version: take the size of the N largest datafiles you have, where N is the number of processors you will use. Add these together, and divide by 3.

If running undemultiplexed data (default in AftrRAD) on a single processor (not in parallel)…

Current version (4.1): Take the size of the largest datafile and divide by 3.

Upcoming version: Take the size of the largest datafile and divide by 3.

If running undemultiplexed data (default in AftrRAD) in parallel…

Current version (4.1): Take the size of the largest datafile and divide by 3.

Upcoming version: Take the size of the largest datafile and divide by 3.

For Hard Drive/Disk Space (Memory):

This is a little more difficult to predict. It will depend on a number of factors including the number of loci (i.e. enzymes that are common cutters tend to produce more loci, and therefore, larger files than less frequent cutters), the number of individuals, sequencing depth, etc. I think average runs we do generally require that something in the range of 5-20 GB of space is available for storage of temporary files needed during the run. I'd say if you have 100-200GB of free space, you're probably safe, even with your large dataset, but again, no promises.

Note that if you run short on RAM, you will likely know because the computer will get very slow/freeze, and the run won't finish (at least not in an amount of time you'll be happy with). In contrast, if you run out of hard disk space during a run, this will likely lead to errors/warnings popping up that are difficult to explain otherwise.