Too many files in read_partitions, exceeding disk quota

501 views
Skip to first unread message

Matt Stata

unread,
Mar 13, 2015, 12:30:58 PM3/13/15
to trinityrn...@googlegroups.com
Hi everyone,

I am trying to run Trinity on a server, and assemble a few transcriptomes at once.  However I have a file number limit of 1 million, and Trinity keeps exceeding this with the massive number of files output to read_partitions.  I could archive and delete these files after Trinity completes but that means running one at a time, rather than parallel, which would take quite a while.  Is there some way to suppress or modify this output behavior to allow Trinity to run on this server?

Thanks for any help you can offer.

Dan Browne

unread,
Mar 13, 2015, 12:40:04 PM3/13/15
to trinityrn...@googlegroups.com
How convenient that you would post this - I ran into the exact same problem this morning! My disk quota is 50,000 files and I just put in a request to bump it up to 500,000. But clearly, if you have a limit of 1,000,000, then my requesting 500,000 will probably be insufficient.

I am wondering, did you attempt to use the --grid_conf option? Or is the read_parition process independent of the --grid_conf option? Personally, I used the --grid_conf option, though I'm not sure it's even doing anything. I haven't gotten any specific errors from it, but I also haven't seen any jobs created by it - who knows though, maybe the jobs just don't appear in the "bjobs -u" list.

Matt Stata

unread,
Mar 13, 2015, 1:18:08 PM3/13/15
to trinityrn...@googlegroups.com
Hi Dan,

I only exceed the disk quota because I am trying to run a few (~6) Trinity assemblies at once, and so my quota is exceeded before any post-processing steps (tarballing the read_partitions for example) get a chance to run.  For a single assembly probably 500,000 would be okay for you.

I have not tried the grid_conf option, will look into this now.

If anyone else knows of an option to reduce the number of output files, please help us! :-)

Cheers,
Matt

Ken Field

unread,
Mar 13, 2015, 1:51:18 PM3/13/15
to Matt Stata, trinityrn...@googlegroups.com
I was able to get around this limitation by running the Trinity process within the scratch directory, which does not have any file limit on my system. I'm not sure if this option is available to you.
Ken

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building

Matt Stata

unread,
Mar 13, 2015, 1:53:34 PM3/13/15
to trinityrn...@googlegroups.com, matt...@gmail.com
Hi Ken,

On the server I use, the scratch folder IS the one with the 1M file limitation.  So unfortunately no, not an option.

Matt


On Friday, 13 March 2015 13:51:18 UTC-4, Ken Field wrote:
I was able to get around this limitation by running the Trinity process within the scratch directory, which does not have any file limit on my system. I'm not sure if this option is available to you.
Ken
On Fri, Mar 13, 2015 at 1:18 PM, Matt Stata <matt...@gmail.com> wrote:
Hi Dan,

I only exceed the disk quota because I am trying to run a few (~6) Trinity assemblies at once, and so my quota is exceeded before any post-processing steps (tarballing the read_partitions for example) get a chance to run.  For a single assembly probably 500,000 would be okay for you.

I have not tried the grid_conf option, will look into this now.

If anyone else knows of an option to reduce the number of output files, please help us! :-)

Cheers,
Matt


On Friday, 13 March 2015 12:40:04 UTC-4, Dan Browne wrote:
How convenient that you would post this - I ran into the exact same problem this morning! My disk quota is 50,000 files and I just put in a request to bump it up to 500,000. But clearly, if you have a limit of 1,000,000, then my requesting 500,000 will probably be insufficient.

I am wondering, did you attempt to use the --grid_conf option? Or is the read_parition process independent of the --grid_conf option? Personally, I used the --grid_conf option, though I'm not sure it's even doing anything. I haven't gotten any specific errors from it, but I also haven't seen any jobs created by it - who knows though, maybe the jobs just don't appear in the "bjobs -u" list.

On Friday, March 13, 2015 at 11:30:58 AM UTC-5, Matt Stata wrote:
Hi everyone,

I am trying to run Trinity on a server, and assemble a few transcriptomes at once.  However I have a file number limit of 1 million, and Trinity keeps exceeding this with the massive number of files output to read_partitions.  I could archive and delete these files after Trinity completes but that means running one at a time, rather than parallel, which would take quite a while.  Is there some way to suppress or modify this output behavior to allow Trinity to run on this server?

Thanks for any help you can offer.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Mar 13, 2015, 2:08:39 PM3/13/15
to Matt Stata, trinityrn...@googlegroups.com
Hi all, 

As we know, Trinity generates a boatload of files, and this is something that clearly needs to be properly dealt with soon.

In the meantime, the latest release (2.0.6) and a few earlier patches should be better at file management than earlier releases.  You'll still generate a large number of targeted read sets, but the Trinity run on each read set should be cleaning up after itself and generating just a single output file.  Of course, if you have a million read partitions, then you'll still end up with a million or more intermediate files being generated.

The number of partitioned read sets can grow exponentially as you reduce the minimum contig length.  The default min contig length is 200. If you're going below 100, the issue will be greatly exacerbated.

You can counteract this by both increasing the minimum contig length as well as increasing the minimum kmer coverage (--min_kmer_cov and --min_contig_length). However, both will reduce sensitivity for reconstructing parts of lowly covered (lowly expressed) transcripts.  Of course, this also depends on how deeply you've sequenced.

My final bit of advice is to *always* run --normalize_reads when you have a lot of reads (hundred million or more). It'll go faster, generate fewer files, and fewer headaches. ;)

best,

~brian



To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Matt Stata

unread,
Mar 13, 2015, 2:11:27 PM3/13/15
to trinityrn...@googlegroups.com, matt...@gmail.com
Hi Brian,

Thank you!  This is excellent advice, I will give this a shot right now.  One quick question: What would you recommend for the minimum k-mer coverage?

Cheers,
Matt
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsubscribe...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--

Brian Haas

unread,
Mar 13, 2015, 2:14:29 PM3/13/15
to Matt Stata, trinityrn...@googlegroups.com
Default is 1.  You can go to 2 and that should have a big impact.  I'd hesitate to go much higher unless you have a very deep data set, and if that were the case, you'd be better off insisting on doing the normalization.

best,

~b

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Matt Stata

unread,
Mar 16, 2015, 8:54:47 AM3/16/15
to trinityrn...@googlegroups.com
Thanks so much for your help Brian, the advice you gave got me down to more like 100-200k files, and allowed me to run a few assemblies in parallel without exceeding my 1M files limit.  Thanks!
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Mar 16, 2015, 9:38:47 AM3/16/15
to Matt Stata, trinityrn...@googlegroups.com
excellent!  thanks for the update,

~b

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Martin Smith

unread,
Jun 15, 2015, 5:32:44 AM6/15/15
to trinityrn...@googlegroups.com, matt...@gmail.com
On Saturday, 14 March 2015 05:08:39 UTC+11, Brian Haas wrote:
[...] You'll still generate a large number of targeted read sets, but the Trinity run on each read set should be cleaning up after itself and generating just a single output file. [...]

Must the "--full_cleanup" option be set for this to occur? I'm interested in comparing the effect of different butterfly parameters on my assembly, therefore I would like to keep the output of chrysalis. However, I keep running into a #file quota  limit, where each run will generate millions of files. 

Brian Haas

unread,
Jun 15, 2015, 9:26:09 AM6/15/15
to Martin Smith, trinityrn...@googlegroups.com, Matt Stata
Hi Martin,

It's mostly useful to experiment with the butterfly parameters when exploring very specific genes/graphs.  If you're looking at bulk statistics, you're not going to see much difference - and I'm thinking it's probably not worth your while.   Others might comment here as well based on any experiences they've had.

best,

~brian


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages