Some trouble with SOAPdenovo, my assembly and the BGI output

320 views
Skip to first unread message

Andrea Strazzulli

unread,
Jul 15, 2013, 6:36:41 AM7/15/13
to bgi-...@googlegroups.com
Hi there,
this is my first post in BGI-SOAP Group but I looked the community from long time.
I started these last 6 month to work with metagenomics using the sequencing service and analysis of BGI of some samples (formally I'm a biochemist with good computer skills) .

My problem is that to learn about the assembly strategies i tried to reproduce the outpout of BGI on my first sample using SOAPdenovo with the k-mer values that are indicated in the final report to check the best N50 and N90 score, the best length etc.

So I prepared my SOAP_config_file with the illumina_clean_reads:

Quote:
---------
#maximal read length
max_rd_len=90
[LIB]
#average insert size
avg_ins=170
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 100 bps of each read
rd_len_cutoff=90
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=clean_read_1.fq
q2=clean_read_2.fq
---------

BGI (testing 21<kmer<63) report that for this sample the best was kmer=33 with 
Quote:
sequence n°: 1543
total length: 6169928
max length: 126110
min length: 500
N50: 15863
N90: 1181
Then I run SOAPdenovo-63mer with 21<kmer<63.

My statistics however for the same samples with kmer=33 showed:
Quote:
sequence n°: 188926
total length: 16591907
max length: 15246
min length: 34
N50: 74
N90: 42
Then I applied a filter to remove the sequences with less than 500bp and the result was:

Quote:
sequence n°: 2529
total length: 3094882
max length: 15246
min length: 500
N50: 1384
N90: 584
Somebody know why these results are so different?

I tried to run SOAPdenovo both in step-by-step mode and in single command but the result does not change and the same differences are present with the other kmer values if compared with the BGI comparison of assembly result on different kmer.

Ruibang Luo

unread,
Jul 15, 2013, 11:03:34 PM7/15/13
to bgi-...@googlegroups.com
Did you run GapCloser on the genome?


Ruibang


--
You received this message because you are subscribed to the Google Groups "BGI-SOAP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bgi-soap+u...@googlegroups.com.
To post to this group, send email to bgi-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bgi-soap/0fb6aeea-c5b8-4543-a5e2-b0e602aa5b65%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Andrea Strazzulli

unread,
Jul 16, 2013, 6:08:07 AM7/16/13
to bgi-...@googlegroups.com, luoru...@genomics.org.cn
My problem is at contigs level, but I understand that Gapcloser improve just the scaffolding output.

Right?

Ruibang Luo

unread,
Jul 16, 2013, 6:27:23 AM7/16/13
to Andrea Strazzulli, bgi-...@googlegroups.com
After Gapcloser, many adjacent contigs gapped with ambiguous base N were connected into a singled contig.

So the contig length you've mentioned should be evaluated after gapcloser.

In detail, divide each scaffold into one or more contigs by remove character N amid.
Then, for all the resulting contigs, remove those length under 100bp.

Ruibang

On 2013-7-16, at 下午6:08, Andrea Strazzulli <andreast...@gmail.com> wrote:

> My problem is at contigs level, but I understand that Gapcloser improve
> just the scaffolding output.
>
> Right?
>
> Il giorno martedì 16 luglio 2013 05:03:34 UTC+2, Ruibang Luo ha scritto:
>>
>> Did you run GapCloser on the genome?
>>
>>
>> Ruibang
>>
>>
>> On 2013-7-15, at 下午6:36, Andrea Strazzulli <andreast...@gmail.com<javascript:>>
>> email to bgi-soap+u...@googlegroups.com <javascript:>.
>> To post to this group, send email to bgi-...@googlegroups.com<javascript:>
>> .

Andrea Strazzulli

unread,
Jul 17, 2013, 3:58:00 AM7/17/13
to bgi-...@googlegroups.com
The statistics by BGI (my reference values) are referred to "contigs", then when I check about my statistic values of the assembly performed with -K 33 I look the N50 and N90 of the .contig file.

Moreover yesterday I tried to use an other option in SOAPdenovo "-d 2" [quote](KmerFreqCutoff: kmers with frequency no larger than KmerFreqCutoff will be deleted)[/quote] and my N50/ N90 have passed from 1384/584 to 3531/733 that maybe is a nice result but far from the value reported from BGI 15863/1181.

There is in general a "conventional value" used for  "-d"?

Cheers.
Andrea

Manoj Samanta

unread,
Jul 17, 2013, 9:44:12 AM7/17/13
to bgi-...@googlegroups.com
Whether you take two ends of paired end reads as two unrelated reads
or one long read with a small gap in between can change N50, because
you are effectively dealing with 225 nt gapped read instead of two 100
nt reads. Longer reads --> longer contigs.

I do not know whether that is the source of the difference in your
case, but it is worth checking. The difference will come at the contig
level, because the PE gap can be easily filled up to make contigs
longer. Possibly Ruibang is suggesting that, when he mentioned
gapcloser.
> --
> You received this message because you are subscribed to the Google Groups
> "BGI-SOAP" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to bgi-soap+u...@googlegroups.com.
> To post to this group, send email to bgi-...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/bgi-soap/61b1e5ff-3c0a-46aa-95a5-ac5b9ce8d067%40googlegroups.com.

Andrea Strazzulli

unread,
Jul 18, 2013, 7:02:20 AM7/18/13
to bgi-...@googlegroups.com
I agree, in fact the fasta file from BGI shows several  "NNNNN". Moreover, when I applied GapCloser the N50 increased but always less than BGI value.

In addition, because I had previously changed the ask-flag value to 1, I tried to apply the scaffolding step using the value "ask-flag=3" instead of 1 whit:

$SOAPdenovo-63mer all -s config -R -d 4 -K 33 -o prefix 1>ass.log 2>ass.err

and the N50 value passed from 6627 (ask-flag=1) up to 20600 (ask-flag=3).

Now I'm running SOAPdenovo with "-d 3 -K 33" and ask-flag=3.
Reply all
Reply to author
Forward
0 new messages