Re: duplicate contigs?

Shaun Jackman

unread,

Feb 15, 2011, 5:12:26 PM2/15/11

to Brian C. Thomas, ABySS

Hi Brian, (cc'ed to abyss-users)

It is possible for two contigs to overlap by a large amount -- and this
may be more likely in a metagenomics assembly due to the large amount of
sequence variants. These large overlaps should be reported in the
${name}-contigs.dot file. Two contigs should not however be 100%
identical from end to end. Do you have an example of this?

Cheers,
Shaun

On Tue, 2011-02-15 at 06:10 -0800, Brian C. Thomas wrote:
> Hi Shaun,
>
> I've been using ABySS (1.2.5 and now 1.2.6) and have been really happy with how it's performing on our metagenomics data. However, we're see one anomaly that I can't figure out. A large percentage of of the contigs have an exact 100% duplicate contigs (or a 99% identical duplicate). I can't figure out why the assembler would would create two identical contigs.
>
> The way we uncovered this was by mapping reads back to the ABySS contigs. When we were using Newbler (gsMapper/runMapping), when it finds two identical sequences it doesn't map reads to either. Then I verified their presence in the duplicate contigs in the contigs file by doing and all-vs-all blast.
>
> Any ideas? Could I be missing a parameter? The assemblies themselves are standard abyss-pe runs with paired-end reads (not trimmed). I can send you any output if you have time or need to look into it.
>
> Thanks in advance.
>
>
> Sincerely,
>
> bct

Brian C. Thomas

unread,

Feb 15, 2011, 6:38:17 PM2/15/11

to Shaun Jackman, ABySS

Hi Shaun,

I looked into the *.dot file - it's 4.5 million lines - is there something i could parse out of it to find the large overlaps? I doubt i could get it open in a graphviz app to view it at this size (but I'm open to suggestions if you know of one).

I have 100% matches, but you are correct: they are not from end-to-end. For example, a 5000bp contig may be found in some different, larger contig - with perfect identity. I guess i'm confused how that can happen?

Thanks!

bct

Shaun Jackman

unread,

Feb 15, 2011, 6:49:13 PM2/15/11

to Brian C. Thomas, ABySS

Hi Brian,

If the 5000-bp contig is contained entirely within a larger contig, that
is also an error. Do you have an example of that?

For the dot file, the lines containing `d=' indicate two contigs that
overlap by the specified amount (a negative value indicates overlap, d
stands for distance). You can use grep to find the large overlaps. For
example, to find all overlaps at least 1000 bp:
egrep 'd=-[0-9]{4}' ${name}-contigs.dot

Cheers,
Shaun

Shaun Jackman

unread,

Feb 16, 2011, 3:51:22 PM2/16/11

to Brian C. Thomas, ABySS

Hi Brian, (cc'ed to the ABySS mailing list)

You can set the parameters e and c to remove sequence with lower
coverage. If you did not specify these parameters, ABySS will choose
values for these parameters, and the chosen values will be specified in
the log file. There are no options to assemble the less-represented
sequence.

Cheers,
Shaun

On Wed, 2011-02-16 at 12:45 -0800, Brian C. Thomas wrote:
> OK- I see that now too. is there a way to tell abyss about the coverage? For example, if i wanted to only assemble the lower coverage representatives inthe dataset? In velvet there is -exp_cov. In general, how would you expect abyss to perform in samples with varying levels of coverage?
>
> thanks
>
> bct

>
> On Feb 16, 2011, at 11:51 AM, Shaun Jackman wrote:
>
> > Hi Brian,
> >

> > The two sequences differ by 6 bp in 35889 bp. That's not a lot granted,
> > but they are different!
> >
> > Cheers,
> > Shaun

Nathaniel Street

unread,

Feb 17, 2011, 8:27:40 AM2/17/11

to Shaun Jackman, Brian C. Thomas, ABySS

Hi Shaun

Sorry to ask about 'over assembly' again but it's the main issue I'd like to get a grip on at the moment (that and optimal use of 454 data in a hybrid assembly).

If we assume that the sequenced sample referred to in this message was a single individual but a diploid with some heterozygosity, what's the best way to tweak the parameters to control whether such large contig overlaps get merged/collapsed? I would definitely prefer to have a single contig rather than two 35 Kb contigs that differ by only 6 bp. If I look at my own assembly, I have 235 contigs that overlap by more than 10 Kb and the ones that I've looked at have about the number of differences between them that I would expect from the level of polymorphism I have.

Maybe the best thing is to pass the contigs through something like CAP3 if it's not possible to control this by assembly parameter tweaking?

Many thanks

Nathaniel

--
Nathaniel Street
Umeå Plant Science Centre
Department of Plant Physiology
Umeå University
SE901-87 Umeå
SWEDEN

email: nathanie...@plantphys.umu.se
skype: nathaniel_street
tel: 0046907865473
fax: 0046907866676

www.popgenie.org

Shaun Jackman

unread,

Feb 17, 2011, 1:19:08 PM2/17/11

to Nathaniel Street, Brian C. Thomas, ABySS

Hi Nathaniel,

`Over assembly� is a problem that I'm working on at this very moment.
Sequences with identity more than p are popped. The default is 0.9 and
it would be reasonable to decrease p to somewhere between 0.6 and 0.8.
You can also increase the seed contig size parameter, s. The default is
100. You could try values of 200, 500 and 1000 (or anything in between).

The sequences that are not popped and are at least s bp are then
extended using their paired-end information, which is when the extremely
large duplication appears, because two 200 bp contigs that are similar
but below the identity threshold can then be extended to thousands of
bp.

Cheers,
Shaun

Reply all

Reply to author

Forward