Hi Xiaofei,
Consider that each perfectly repeat sequence is represented only once in the unitigs file, even though it may occur many times in the genome. It is my understanding that ABySS will output multiple copies of such repetitive sequences during the contig and scaffold stages. So for example if unitigs A and B both overlap a repeat unitig C, C will be merged onto the ends of both unitigs A and B when building contigs. I suspect that is the reason for the unexpected growth in total length.
As an aside, there are other things that happen during assembly that affect your total sequence length in the opposite direction (making it unexpectedly smaller):
* removing "shim" contigs (abyss-filtergraph program)
* second bubble popping stage (PopBubbles program)
* merging sequences that overlap (Overlap and PathOverlap programs)
* merging alternate paths into a consensus sequence (PathConsensus program)
If you're really curious what is happening during assembly, you can look at each step of the pipeline in the `abyss-pe` Makefile and also the output in the corresponding FASTA file (*-1.fa, *-2.fa, etc.).
- Ben--
You received this message because you are subscribed to the Google Groups "ABySS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abyss-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.