Assembling huge fastq files

139 views
Skip to first unread message

Samuel Abalde

unread,
Aug 19, 2016, 7:45:52 AM8/19/16
to trinityrnaseq-users
Hi all,

I think this issue has been written and solved many times, but after reading all those threads I wasn't able to fix my problem.

The thing is: I'm running an experiment with 13 individuals, 2 tissues each. I want to analyse the DE, so following several papers and threads, I've decided to merge all those reads files and to assemble them together. The result is two files (paired-end) of 230G each and it takes too (too) long (beyond my deadline) to assemble them and I though any of you could have an advice to improve this performance.

Statistics:
===========
Trinity Version:      trinityrnaseq_r20140717
Compiler:             GCC
Trinity Parameters:   --seqType fq --JM 40G --CPU 32 --inchworm_cpu 32 --bflyCPU 16 --left /gpfs/csic_users/saabalde/scratch/merged_reads/all_reads_1.fastq.gz --right /gpfs/csic_users/saabalde/scratch/merged_reads/all_reads_2.fastq.gz --output /gpfs/csic_users/saabalde/scratch/merged_reads/trinity/


I know I should normalize my reads to improve the memory usage, but that process returns an error message (I read in another thread one of the files may be corrupted, but it works perfectly) and I thought it woud finish in time anyways.

I thought it was running better than it really is. I'm stuck in the Inchworm process, reading the Kmers. This is a short time-table:

kmer         running-time
454M 0-03:27:00
1610M 0-05:38:48
1610M 0-07:52:48
1610M 0-10:23:42
1610M 0-21:00:20
1621M 3-20:55:42
1625M 4-01:07:23

My time limit arrives after 30 days (26 days left)

Does anybody have any suggestion? Do you think it's finishing in time?

Many thanks in advance and I'm sorry for repeting an issue solved so many times.
Samu

Samuel Abalde

unread,
Aug 19, 2016, 7:52:11 AM8/19/16
to trinityrnaseq-users
Btw, in case this information can be useful, the jellyfish.kmers.fa file contains: 4.527.767.535 sequences.

samu

Brian Haas

unread,
Aug 19, 2016, 8:08:24 AM8/19/16
to Samuel Abalde, trinityrnaseq-users
Hi

Setting --min_kmer_cov 2
should help dramatically in this case.  I only recommend that here given the performance issues you're encountering.

-Brian
(by iPhone)

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Samuel Abalde

unread,
Aug 19, 2016, 8:24:15 AM8/19/16
to trinityrnaseq-users

Ok, I'll do that and I'll let you know. Many thanks, Brian.

Samu

Mark Chapman

unread,
Aug 19, 2016, 11:21:11 AM8/19/16
to Samuel Abalde, trinityrnaseq-users

Hi Samu,
You can normalise each of your libraries individually and then normalise all of these together. If one file is corrupt this just won't go into your final assembly but are you expecting anything unique in one of your libraries? Not ideal, just suggesting a workaround.
Best, Mark


On 19 Aug 2016 14:24, "Samuel Abalde" <saab...@gmail.com> wrote:

Ok, I'll do that and I'll let you know. Many thanks, Brian.

Samu

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.

Samuel Abalde

unread,
Aug 19, 2016, 12:07:46 PM8/19/16
to trinityrnaseq-users
Thanks Mark, that's actually a good idea and I'm kind of embarrashed for not thinking about it before. Now Trinity is running after setting the --min_kmer_cov, so I'm gonna let it finish and see what happens. Anyways, I have more analyses to do with other data, so I'll use your approach, too.

Thanks again,
Samu

Samuel Abalde

unread,
Aug 20, 2016, 7:37:06 AM8/20/16
to trinityrnaseq-users
Update

I've solved the kmers issue by setting a --min_kmer_cov, but now I have another problem

 done parsing 1363620151 Kmers, 1363620151 added, taking 2356 seconds.


TIMING KMER_DB_BUILDING
2356 s.
Pruning kmers (min_kmer_count=1 min_any_entropy=0 min_ratio_non_error=0.05)
Pruned 31493835 kmers from catalog.
       
Pruning time: 2978 seconds = 49.6333 minutes.


TIMING PRUNING
2978 s.
-populating the kmer seed candidate list.
Kcounter hash size: 1363620151
terminate called after throwing an instance of
'std::bad_alloc'
  what
():  std::bad_alloc
sh
: line 1: 20180 Aborted                 /gpfs/res_projects/apps/TRINITY_RNA_SEQ/r2014-07-17/Inchworm/bin/inchworm --kmers jellyfish.kmers.fa --run_inchworm -K 25 -L 25 --monitor 1 --DS --keep_tmp_files --num_threads 64 --PARALLEL_IWORM > /gpfs/res_scratch/cvcv/saabalde/merged_reads/trinity/inchworm.K25.L25.DS.fa.tmp
Error, cmd: /gpfs/res_projects/apps/TRINITY_RNA_SEQ/r2014-07-17/Inchworm/bin/inchworm --kmers jellyfish.kmers.fa --run_inchworm -K 25 -L 25 --monitor 1  --DS  --keep_tmp_files  --num_threads 64  --PARALLEL_IWORM  > /gpfs/res_scratch/cvcv/saabalde/merged_reads/trinity/inchworm.K25.L25.DS.fa.tmp died with ret 34304 at /gpfs/res_apps/TRINITY_RNA_SEQ/r2014-07-17/Trinity line 1990.






If it indicates bad_alloc(), then Inchworm ran out of memory.  You'll need to either reduce the size of your data set or run Trinity on a server with more memory available.


** The inchworm process failed.

I'm running out of memory even increasing the number of available CPUs to 64 in 4 nodes. I guess I don't have any other choice than normalizing my reads. Do you think that would fix it? Do you know how much memory I could need (to see if I'm able to get them)?

Thanks again,
Samu

Brian Haas

unread,
Aug 20, 2016, 7:43:47 AM8/20/16
to Samuel Abalde, trinityrnaseq-users
Hi Samu,

We recommend having ~1G RAM per ~1M PE reads.  Normalization is definitely recommended and will be a default setting in the next trinity release.

If you want, you can try our Galaxy service:

Just upload your data, run Trinity, and download your resulting assembly. 

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Samuel Abalde

unread,
Aug 25, 2016, 4:38:53 AM8/25/16
to trinityrnaseq-users
Update.

Good morning Brian and Mark,
I wanted to update you after some days of work and tests.

I've followed Mark advice and normalized all reads files individually, what has allowed me to see that none of my files is corrupted. I've merged them again and now normalization works perfectly.
I've decided to run Trimmomatic too, in order to reduce the number of reads, but it didn't work because i already had filtered them and none new read was removed.

After doing this, I have 61.589.535 million reads (paired-end) and 256G RAM, which matches Brian recomendation: ~1G RAM per ~1M PE reads.


I still have 1059214700 Kmers, so if I may still have to use --min_kmer_cov 2, but now I expect to be able to handle my sequences.

Many thanks for your help, I'll let you know how things are doing.
Samu

Brian Haas

unread,
Aug 25, 2016, 9:24:39 AM8/25/16
to Samuel Abalde, trinityrnaseq-users
thanks for the update. I'm glad to hear that it's going better now.

Note, there is a parameter:

 #  --normalize_by_read_set              run normalization separate for each pair of fastq files,
#                                       then one final normalization that combines the individual normalized reads.
#                                       Consider using this if RAM limitations are a consideration.

that's useful for automating the per-library normalization.  You'll only see this advanced parameter if you use the Trinity --show_full_usage_info flag.

If inchworm does crash with your normalized data, then you could go with the --min_kmer_cov 2.  If you get passed the inchworm stage, then hopefully it's smooth sailing thereafter.


best,

~brian



--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Samuel Abalde

unread,
Aug 30, 2016, 11:11:03 AM8/30/16
to trinityrnaseq-users

Hi!

Now Trinity is running, but I don't know if this is normal. It looks kind of stuck here:


###################################


---------------------------------------------------------------
------ Quality Trimming Via Trimmomatic  ---------------------
<< LEADING:5 TRAILING:5 MINLEN:36 >>
---------------------------------------------------------------


###############################################################################
#### Trimmomatic  process was previously completed. Skipping it and using existing qual-trimmed files: all_reads_1.fastq.gz.PwU.qtrim.fq, all_reads_2.fastq.gz.PwU.qtrim.fq
###############################################################################
---------------------------------------------------------------
------------ In silico Read Normalization ---------------------
-- (Removing Excess Reads Beyond 50 Coverage --
-- /gpfs/res_scratch/cvcv/saabalde/merged_reads/trinity/insilico_read_normalization --
---------------------------------------------------------------


###############################################################################
#### Normalization process was previously completed. Skipping it and using existing normalized files: /gpfs/res_scratch/cvcv/saabalde/merged_reads/trinity/insilico_read_normalization/left.norm.fq /gpfs/res_scratch/cvcv/saabalde/merged_reads/trinity/insilico_read_normalization/right.norm.fq
###############################################################################
-------------------------------------------
----------- Jellyfish  --------------------
-- (building a k-mer catalog from reads) --
-------------------------------------------


Saturday, August 27, 2016: 12:27:28     CMD: /gpfs/res_projects/apps/TRINITY_RNA_SEQ/r2014-07-17/trinity-plugins/jellyfish/bin/jellyfish count -t 64 -m 25 -s 9792174900  --canonical  both.fa


since it's been performing this step more than 3 days already. Is this normal or it's stuck? It had never taken this long before...

Thanks,
Samu

Brian Haas

unread,
Aug 30, 2016, 2:58:01 PM8/30/16
to Samuel Abalde, trinityrnaseq-users
You'll want to look at what memory is currently being used. If it's gone into swap space, it could be in trouble.

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Samuel Abalde

unread,
Aug 30, 2016, 3:03:36 PM8/30/16
to trinityrnaseq-users
But how can the RAM be full if I'm using more memory than other times?

Brian Haas

unread,
Aug 30, 2016, 3:07:47 PM8/30/16
to Samuel Abalde, trinityrnaseq-users
I don't know what the specific issue is here. If you decide to run it on our galaxy server, then we can directly look into any issues if they should arise.

Samuel Abalde

unread,
Sep 28, 2016, 4:43:40 AM9/28/16
to trinityrnaseq-users
Update.

Sorry for the late answer. I've been trying different thing but always with the same result. Trinity gets stuck in:

-------------------------------------------
----------- Jellyfish  --------------------
-- (building a k-mer catalog from reads) --
-------------------------------------------


Saturday, August 27, 2016: 12:27:28     CMD: /gpfs/res_projects/apps/TRINITY_RNA_SEQ/r2014-07-17/trinity-plugins/jellyfish/bin/jellyfish count -t 64 -m 25 -s 9792174900  --canonical  both.fa

I'm gonna follow your advice and already signed in on Galaxy server. I'm currently waiting for an answer. Is there any other place where we can talk about the issues we encounter there?

Thanks for everything
Samu

Brian Haas

unread,
Sep 28, 2016, 8:45:51 AM9/28/16
to Samuel Abalde, trinityrnaseq-users
Try upgrading trinity too.  That's an old version you're using.  Maybe the newer one will work better for you

-Brian
(by iPhone)

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages