SOAP denovo-Trans crashing on very large RNA-Seq data set

Joe Carl

unread,

Aug 20, 2013, 2:10:14 PM8/20/13

to bgi-...@googlegroups.com

Background:

I'm working on a large transcriptome project, that has a huge coverage. We have gathered samples from many time points on the same genome from many tissues. We ran one sample per lane using a TruSeq LT sequencer. We have paired reads each pair.fastq is on the order of 40G, for each time point. the reference species is suspected to have many errors and we would like to perform a de novo transcriptome assembly on all RNA species (not just mRNA)

GOAL:

Assemble a de novo transcriptome for a single time point consisting of paired end reads each being 40G in size (total 80G)

Problem:

After adapter trimming and quality filtering, I executed the following command SOAPdenovo-Trans-127mer all -s config.insitu-u -K 35 -o Out_Insitu

where the config.insitu-u looks like

max_rd_len=101

[LIB]

rd_len_cutof=75

avg_ins=280

reverse_seq=0

asm_flags=3

map_len=32

q1=in_situ_R1.fastq

q2=in_situ_R2.fastq

I am using a Amazon Cloud AMI type m2.4xlarge which should give me 68.4 Gb RAM, 8 CPUs, and 700Gb Hard disk .

the algorithm successfully completes the PreGraph construction, but fails on the very first attempted read at 100000000th read

Questions:

What do I need to do to make it complete it's processing? Is there a strategy to partition the reads and use SOAPdenovo Trans on each partion, then join them back up? Is there a workflow that helps me understand how to do this?

Ultimate Question:

I would like to do a de novo assembly using the entire sample set, is there some way to parrallelize the processing?

Joe

Ruibang Luo

unread,

Aug 22, 2013, 12:19:26 AM8/22/13

to <bgi-soap@googlegroups.com>, <bgi-soap@googlegroups.com>

1. Please try 63mer version instead of 127mer version.

2. Considering your data amount, I don't think you've got enough memory. What's the exactly error message you've got on program termination?

Ruibang

從我的 iPhone 傳送

--
You received this message because you are subscribed to the Google Groups "BGI-SOAP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bgi-soap+u...@googlegroups.com.
To post to this group, send email to bgi-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bgi-soap/866f615c-be71-461d-a260-d1bcca75ef4b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Manoj Samanta

unread,

Aug 22, 2013, 12:46:38 AM8/22/13

to bgi-...@googlegroups.com

On 8/20/13, Joe Carl <joseph.w...@gmail.com> wrote:
>
>
> Questions:
> What do I need to do to make it complete it's processing? Is there a
> strategy to partition the reads and use SOAPdenovo Trans on each partion,
> then join them back up?
>

Partitioning reads is not a good strategy for assembly. It reduces
coverage of each partition and therefore creates bad assembly for low
coverage genes.

Joe Carl

unread,

Aug 22, 2013, 9:27:56 AM8/22/13

to bgi-...@googlegroups.com

If I don't have enough memory, and I"m using a a lot, then I may have reached the memory limit that SOAPdenovo Trans can handle. It would seem that for extremely large data sets, especially ones that are on the same individual and can grow in size. Transcriptome from Cell type 1 and 2 and 3 and ... are all subsets of the full transcriptome no matter what time point you take the sample. So the data set can get bigger and bigger.

Meaning sooner or later you will hit the upper limit of SOAPdenovo Trans. What are you suppose to do when you hit that limit?

If you can't use smaller partitions of the data set the keep the size within SOAPdenoov Trans limits, how do you deal with the problem?

I believe as solution which allows the partitioning must be feasible and developed. Does anyone know how to parrallelize this process so we don't hit the upper limit of the application?

Manoj Samanta

unread,

Aug 22, 2013, 1:12:16 PM8/22/13

to bgi-...@googlegroups.com

Here is how the assemblers work. The program forms a giant graph of
k-mers and then slowly resolves that graph into contiguous sequences.
When you have very large amount of data from an organism, small part
of those k-mers come from the organism, but a large part come from
errors. The amount of k-mers from the genome (the real k-mers)
saturates with more reads, because the genome is of fixed size. The
number of erroneous k-mers keeps going up with more reads.

There are several ways to handle the problem -

(i) SOAPdenovo2 has a sparse pregraph module that does the memory
intensive stage of the assembly with less RAM. I do not know, whether
BGI ported it to to SOAPdenovo2-trans, but that would be the first
place to go.

(ii) Titus Brown's group wrote a pre-filter program that reduces the
erroneous reads and feeds the assembler with less amount of data. That
could be a better way of partitioning than arbitrarily removing reads.

(iii) Rayan Chikhi wrote a pregraph assembler that works with very
small amount of RAM, essentially borrowing ideas from (ii). Rayan and
his collaborators are writing a transcriptome assembler from that, but
I do not think it will be ready by this week.

> --
> You received this message because you are subscribed to the Google Groups
> "BGI-SOAP" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to bgi-soap+u...@googlegroups.com.
> To post to this group, send email to bgi-...@googlegroups.com.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/bgi-soap/44ac88ee-139d-4b31-90e6-00f75c4bc606%40googlegroups.com.

Reply all

Reply to author

Forward