rRNA-removal runs a very long time

363 views
Skip to first unread message

wouter...@gmail.com

unread,
Jan 5, 2017, 9:35:37 AM1/5/17
to Metatrans Forum
Dear,

First of all thank you for helping me with the errors which MetaTrans gave me at the beginning. With the solutions you offered I managed to get the pipeline running completely.
However, I have a new question which is related to the second step (rRNA removal) in which sortmerna is used. When I was testing the pipeline a few months ago with one of my samples (paired end, with 2 fastq files of each ca. 12GB), the whole pipeline (without differential expression analysis) completed in less than a day (with 20 cores and 58 GB RAM).
Recently, I started working with my real samples and found that, using the same settings as before, it takes a paired end sample (fastq files of ca. 8 GB) 8 days or more to be completed, with by far the most time taken up by the rRNA removal. The files with the assigned reads in the folder "interleaved_rrna" are increasing quickly in size at the beginning but this slows down tremendously after some time.
While running my test sample I used the standard settings for sortmerna in MetaTrans (5 GB), but when I started with my real samples I tried it with SORTMERNA_MEMORY=MAX. Since then, it always took a much longer time to complete the rRNA removal as I mentioned, even when I put the memory limit back to 5 GB (I also tried lowering it to 1 GB, same result).
Do you have any idea why this step takes so much longer now, while I'm actually using fastq files of smaller size than before?
Any help would be greatly appreciated, as it would save me a huge amount of time with all the samples I would like to run.

Thanks in advance.

Best regards,
Wouter

metatr...@gmail.com

unread,
Jan 8, 2017, 6:08:36 PM1/8/17
to Metatrans Forum, wouter...@gmail.com
Hi Wouter,

Great to hear that we could help you. I'm sorry to tell you that we never faced that problem, so I don't know exactly how to
help you in this case as the source of this problem seems to us difficult to track down. What we can give you are some figures:

Sample1_1.fastq,Sample1_2.fastq -> ~ 12GB   -> ~24M seq ->  ~24% rRNA  -> time:  ~45m
Sample2_1.fastq,Sample2_2.fastq -> ~ 9GB    -> ~29M seq ->  ~5%  rRNA   -> time:  ~1h

Settings.txt file configuration. Run under Ubuntu 14.04:
THREADS=30              #Max threads used.
SORTMERNA_MEMORY=MAX   #as maxium uses ~12GB

This is an example of two big files that were analyzed by MetaTrans, As you can see they were even bigger than  yours and they only
took as maximum 1h. But we cannot know whether the behaviour  with files containing >25% of rRNA differs a lot or not.

Not sure if this might help, but you also could check the SortMeRNA issues to see if you can grasp some help from there: https://github.com/biocore/sortmerna/issues

Sorry we cannot provide much more help, greetings

wouter...@gmail.com

unread,
Feb 10, 2017, 10:09:59 AM2/10/17
to Metatrans Forum, wouter...@gmail.com
Hi,

Thank you for this information and sorry for the late reply. Unfortunately, so far we have not been able to change the slowness of sortmerna.
Looking more into detail, we see that (with THREADS=20 in settings.txt), 20 threads indeed are present, but only one is really active. Did you ever notice something similar?
We tried different memory and thread settings, changing the location of the databases for better i/o, compiling our own version of sortmerna v1.9, but unfortunately nothing has changed this.
Some additional info, if available, would be really appreciated, since a runtime of 7+ days with still 6 samples to go is a bit of a problem with upcoming deadlines.

Thanks in advance.

Best regards,
Wouter


Op maandag 9 januari 2017 00:08:36 UTC+1 schreef metatr...@gmail.com:

metatr...@gmail.com

unread,
Feb 13, 2017, 8:30:09 AM2/13/17
to Metatrans Forum, wouter...@gmail.com
Hi Wouter,

I'm afraid we never had that behaviour of SortMeRNA. We did a small test to check it, but using
"htop" (or task manager for instance)  we could observe as all threads were used as selected in "settings.txt".
If you check the file "m2-time.tsv" you will find the command that is launched by the pipeline, and you can "play" with
it independently of the pipeline. This is the command that our pipeline ran for the test:

sortmerna -n 5 \
--db <fullPath>/0-Databases/1-SILVA-23S-28S-LSURef_115_tax_silva.fasta.trimmedwhitespaces.inDNA.fasta \
<fullPath>/0-Databases/1-SILVA-16S-18S-SSURef_115_NR99_tax_silva.trimmedwhitespaces.inDNA.fasta \
<fullPath>/0-Databases/rfam-5s-database-id98.fasta \
<fullPath>/0-Databases/trna_db.fasta \
<fullPath>/0-Databases/phix_db.fasta \
--I 1-PROCESSED_SAMPLES/<sampleName>/2-rRNA-removal/m2-temp/<sampleName>_m2_interleaved.fastq \
--accept 1-PROCESSED_SAMPLES/<sampleName>/2-rRNA-removal/m2-output/interleaved_rrna/irrna \
--other 1-PROCESSED_SAMPLES/<sampleName>/2-rRNA-removal/m2-output/interleaved_mrna/imrna \
--log 1-PROCESSED_SAMPLES/<sampleName>/2-rRNA-removal/m2-log/log_sortmerna \
--bydbs \
-m 3086031 \
--paired-out \
-a 30 \
-v

Another thing you could test is to run the same command in another OS (like the one we used : Ubuntu14.04) using
a VirtualMachine or Docker. You could also take a subsampling (here they propose many ways: https://www.biostars.org/p/6544/;
remember the paired order must be maintained) of your fastq files, interleave them with:

merge-paired-reads.sh \
<fullPath>/1-QC/m1-output/<sampleName>_1_m1.fastq \
<fullPath>/1-QC/m1-output/<sampleName>_2_m1.fastq \
<fullPath>/1-PROCESSED_SAMPLES/<sampleName>/2-rRNA-removal/m2-temp/<sampleName>_m2_interleaved.fastq

and use the output as input for the sortmerna to see if it works with the subsample.

Hope it helps
Reply all
Reply to author
Forward
0 new messages