a question about -unitigs.fa and -scaffolds.fa

115 views
Skip to first unread message

Xiaofei Wang

unread,
Aug 18, 2015, 2:09:27 PM8/18/15
to ABySS
Dear folks,

I see "The *-unitigs is the assembly without using any paired end or mate pair information, and the *-contigs.fa is the next step, using paired end information. The *-scaffolds.fa is your scaffolded contigs".

Here are the stats for my -unitigs.fa,  -contigs.fa,  and -scaffolds.fa as below. On my understanding,  the difference in total length between *-unitigs.fa and *-scaffolds.fa should be the number of Ns in my -scaffolds.fa. Am I right? But, when I counted the number of Ns in "*-scaffolds.fa", there are only 509 Ns. What am I missing? 

Also, on my understanding, the *-scaffolds.fa will use paired/mate pair information to join/scaffold the contigs together, right? If I am right, there should be 99 gaps filled with Ns in my *-scaffolds.fa. But I don't think there are 99 gaps of Ns in *-scaffolds.fa. Where am I missing?

*-unitigs.fa
sequence #: 316 total length: 4678712   max length: 236245      N50: 34091      N90: 10761
*-contigs.fa
sequence #: 221 total length: 4776303   max length: 236245      N50: 40151      N90: 15786
*-scaffolds.fa
sequence #: 217 total length: 4776366   max length: 236245      N50: 40489      N90: 16196

Thank you so much!

Best,

Xiaofei

Ben Vandervalk

unread,
Aug 24, 2015, 11:50:07 AM8/24/15
to abyss...@googlegroups.com
Cross-posting my response from BioStar: https://www.biostars.org/p/154894/

Hi Xiaofei,

Consider that each perfectly repeat sequence  is represented only once in the unitigs file, even though it may occur many times in the genome.  It is my understanding that ABySS will output multiple copies of such repetitive sequences during the contig and scaffold stages. So for example if unitigs A and B both overlap a repeat unitig C, C will be merged onto the ends of both unitigs A and B when building contigs.  I suspect that is the reason for the unexpected growth in total length.

As an aside, there are other things that happen during assembly that affect your total sequence length in the opposite direction (making it unexpectedly smaller):

* removing "shim" contigs (abyss-filtergraph program)

* second bubble popping stage (PopBubbles program)

* merging sequences that overlap (Overlap and PathOverlap programs)

* merging alternate paths into a consensus sequence (PathConsensus program)

If you're really curious what is happening during assembly, you can look at each step of the pipeline in the `abyss-pe` Makefile and also the output in the corresponding FASTA file (*-1.fa, *-2.fa, etc.).

- Ben

--
You received this message because you are subscribed to the Google Groups "ABySS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abyss-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages