Here is how the assemblers work. The program forms a giant graph of
k-mers and then slowly resolves that graph into contiguous sequences.
When you have very large amount of data from an organism, small part
of those k-mers come from the organism, but a large part come from
errors. The amount of k-mers from the genome (the real k-mers)
saturates with more reads, because the genome is of fixed size. The
number of erroneous k-mers keeps going up with more reads.
There are several ways to handle the problem -
(i) SOAPdenovo2 has a sparse pregraph module that does the memory
intensive stage of the assembly with less RAM. I do not know, whether
BGI ported it to to SOAPdenovo2-trans, but that would be the first
place to go.
(ii) Titus Brown's group wrote a pre-filter program that reduces the
erroneous reads and feeds the assembler with less amount of data. That
could be a better way of partitioning than arbitrarily removing reads.
(iii) Rayan Chikhi wrote a pregraph assembler that works with very
small amount of RAM, essentially borrowing ideas from (ii). Rayan and
his collaborators are writing a transcriptome assembler from that, but
I do not think it will be ready by this week.
> --
> You received this message because you are subscribed to the Google Groups
> "BGI-SOAP" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
bgi-soap+u...@googlegroups.com.
> To post to this group, send email to
bgi-...@googlegroups.com.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/bgi-soap/44ac88ee-139d-4b31-90e6-00f75c4bc606%40googlegroups.com.