Abyss-pe assembler produces contigs that contain the same sequence

74 views
Skip to first unread message

Anthony

unread,
Jun 21, 2011, 1:06:58 PM6/21/11
to ABySS
I think I may be assembling some pe illumina data incorrectly since
there are often contigs that are essentially a duplication of a
significant block (up to 100kb) from another contig. These are regions
I know not to be repeat regions in the genome so the reads should have
been assembled into a single contig.

The command I'm running is
abyss-pe k=31 n=14 in='reads_1.fastq reads_2.fastq'
name=abyss_assembly


Any reason tips as to how I can get these contigs to merge? Is it
something to do with pop bubbles?

Thanks

Anthony

Anthony

unread,
Jun 21, 2011, 3:07:54 PM6/21/11
to ABySS
To follow up this has occurred twice with 2 bacterial genomes one with
few repeats and one which is more repetitive. This has caused mis
interpretation of data until we realised what the problem was and
removed the duplicated sequences.

I am very keen to solves this problem since ABySS gives better N50
scores than velvet but this for me is a show stopper at the moment.


Anthony

Shaun Jackman

unread,
Jun 21, 2011, 4:52:51 PM6/21/11
to Anthony, ABySS
Hi Anthony,

What version of ABySS are you running? Is the duplicated region the full
length of the contig or a portion of the contig? Is the duplicated
region perfectly 100% identical or are there any mismatches?

Cheers,
Shaun

Anthony

unread,
Jun 21, 2011, 5:02:11 PM6/21/11
to ABySS
Hi Shaun

I am running the latest version of ABySS 1.2.7.
The duplicated regions are portions of larger contigs. The duplicated
region is highly similar to the larger contig. It may be identical but
I don't have access to the data now being at home. Is there other
output from ABySS that may help debug the problem?

Thanks Anthony

Shaun Jackman

unread,
Jun 21, 2011, 5:06:44 PM6/21/11
to Anthony, ABySS
Hi Anthony,

It will affect the troubleshooting a great deal depending on whether the
duplicated regions are perfectly 100% identical or very nearly
identical. I'll wait to hear back from you.

What value are you using for the parameter s (seed contig size)? If you
haven't specified a value, the default is 100.

Cheers,
Shaun

Anthony

unread,
Jun 21, 2011, 5:22:09 PM6/21/11
to ABySS
OK will get back to you with that data.
By the way when mapping the same reads used to build the contigs to
the contigs themselves using bowtie where reads mapping to non unique
locations are not mapped at all I get no coverage in the region of the
larger contig or the small contig that is the duplicate. I suspect
this means they are identical. I am running another mapping overnight
where reads will map randomly if they could map in more than 1 place.
I expect to see a halving of the average coverage in the duplicated
region.

I don't specify the s parameter - is this something worth playing
with?

Cheers Anthony

Anthony

unread,
Jun 22, 2011, 5:56:42 AM6/22/11
to ABySS
OK so here's the update on what's going on

I have 1 larger contig (280077bp) and a smaller contig (143116bp)
The first 130089 seem to be an exact match (100% identity) of bases
149989 to 280077 in the larger contig with a small break in the
middle. See schemtic in this picture http://cl.ly/3s1V0b322O2P2501211P
and blast result here http://1.usa.gov/jmdnYe

Mapping the reads back to the contigs gives no coverage in the
locations listed above (large contig: 149989 to 280077; small contig:
1 to 130089) when the mapper (bowtie) is told to exclude reads that
map to more than 1 location. When this option is turned off the
coverage in these regions is half that of the adjacent sequence e.g
bases 1 to 149988 in the large contig)

This convinces me that this is a false duplicate and the two contigs
should have been merged into 1 the length of the larger contig.

Any help in solving this worrying occurrence would be appreciated.
Cheers Anthony

On Jun 21, 6:06 pm, Anthony <email2a...@gmail.com> wrote:

Shaun Jackman

unread,
Jun 22, 2011, 3:08:26 PM6/22/11
to Anthony, ABySS
Hi Anthony,

In the file ${name}-5.path, the first column is the contig ID, and the
rest of the line is the IDs of the subsequences that compose that
contig. Can you find the two lines for your two contigs and report them
here?

Cheers,
Shaun

Anthony

unread,
Jun 22, 2011, 5:23:32 PM6/22/11
to ABySS
Hi Shaun

The larger contig is 3661 and the corresponding line in the -5.path
file is:
3661 3490- 1260+ 3435+ 55+ 2602- 2207+ 1529- 3118+ 1565+ 99+ 2051+
2626- 597+ 721+ 395+ 1033- 2770- 16+ 2456+ 253- 2137- 1430- 1794-
3539+ 2166+ 3542+ 3005- 2634- 1448- 986- 1427- 93N 1056- 734- 754-
2457- 658- 2072+ 1273- 247- 1097+ 1099+ 342+ 2295+ 3190- 1874+ 920+
2601- 3190- 1874+ 1338- 1749- 3316-

The smaller contig is 3649 and the corresponding line in the -5.path
file is:
3649 3316+ 1749+ 1338+ 1874- 3190+ 2601+ 920- 1874- 3190+ 2295-
342- 1099- 1097- 247+ 1273+ 2072- 658+ 2457+ 754+ 734+ 1056+ 93N 1427+
986+ 1448+ 2634+ 3005+ 3542- 2166- 3539- 1794+ 1430+ 2137+ 253+ 2456-
1591+ 3537+ 3171- 119+ 1526+

Thanks for looking into this

Anthony


On Jun 22, 8:08 pm, Shaun Jackman <sjack...@bcgsc.ca> wrote:
> Hi Anthony,
>
> In the file ${name}-5.path, the first column is the contig ID, and the
> rest of the line is the IDs of the subsequences that compose that
> contig. Can you find the two lines for your two contigs and report them
> here?
>
> Cheers,
> Shaun
>
>
>
>
>
>
>
> On Wed, 2011-06-22 at 02:56 -0700, Anthony wrote:
> > OK so here's the update on what's going on
>
> > I have 1 larger contig (280077bp) and a smaller contig (143116bp)
> > The first 130089 seem to be an exact match (100% identity) of bases
> > 149989 to 280077 in the larger contig with a small break in the
> > middle. See schemtic in this picturehttp://cl.ly/3s1V0b322O2P2501211P
> > and blast result herehttp://1.usa.gov/jmdnYe

Shaun Jackman

unread,
Jun 22, 2011, 5:35:26 PM6/22/11
to Anthony, ABySS
Hi Anthony,

The two contigs have this sequence in common:


2456+ 253- 2137- 1430- 1794- 3539+ 2166+ 3542+ 3005- 2634- 1448- 986- 1427- 93N 1056- 734- 754- 2457- 658- 2072+ 1273- 247- 1097+ 1099+ 342+ 2295+ 3190- 1874+ 920+ 2601- 3190- 1874+ 1338- 1749- 3316-

This sequence is unique to 3661:


3490- 1260+ 3435+ 55+ 2602- 2207+ 1529- 3118+ 1565+ 99+ 2051+ 2626- 597+ 721+ 395+ 1033- 2770- 16+

and this sequence is unique to 3649 (reverse complement):
1526- 119- 3171+ 3537- 1591-

What are the sizes of the contigs unique to 3649? You can find these
contigs in the ${name}-[345].fa files.

As a last resort, you could edit the ${name}-5.path file manually to
remove the duplicated sequence and run MergeContigs to generate a new
${name}-contigs.fa file.

Cheers,
Shaun

Anthony

unread,
Jun 22, 2011, 5:49:51 PM6/22/11
to ABySS
Hi Shaun

Contig sizes are
1526: 1250bp
119: 31bp
3171: 5753bp
3537: 81bp
1591: 6062bp

I can't really afford to manually search for these errors since I have
10's sometimes 100's of genomes to assemble and I only stumbled on
this error by careful examination. Why do 2 contigs end up having the
same contigs within them? Is there not an automated way to detect and
remove or at least report these. I love ABySS because it's fast, not
memory hungry and produces great N50 scores but now I have seen this
error twice I am worried about its accuracy and other errors I may
have missed in the past. For the moment I have to go back to using
velvet but would like to be able to use ABySS for the reasons listed.

Thanks again for your rapid responses

Anthony

Alejandro Sanchez

unread,
Jun 22, 2011, 7:14:13 PM6/22/11
to Anthony, ABySS
Hi,

I've seen this behaviour in ABYSS before. Like Shaun said there are
specific parts in the contigs that are unique and different in the
path I guess. I've seen some cases where you have almost the same
contig but in reverse comp. Since there is no strand specific
sequencing, shouldn't this contigs be the same? Of course, there are
differences but they could be due to sequencing errors or maybe true
SNPs...

The only thing is that here we have a team for manual curation but
like Anthony said, it would be great to report this things or probably
have a tool/script to perform the filtering.

Cheers.

Shaun Jackman

unread,
Jun 22, 2011, 7:25:04 PM6/22/11
to Anthony, ABySS
Hi Anthony,

The paired-end reads from the smaller contig 1591 to a larger contig
resulted in that smaller contig being extended and the duplication of
the larger contig. Reducing this sort of duplication is an active area
of development for ABySS.

The file ${name}-contigs.dot lists overlapping contigs. For example in
the following line, the contigs 10140 and 10455 overlap by 2516 bp.
"10140+" -> "10455+" [d=-2516]

How long are your reads, and what is your coverage depth?

Cheers,
Shaun

Anthony

unread,
Jun 23, 2011, 9:11:01 AM6/23/11
to ABySS
The reads are 75bp illumina reads with a coverage of approximately
25. This was halved to 12 in the duplicated regions.

Cheers Anthony

Shaun Jackman

unread,
Jun 23, 2011, 3:25:07 PM6/23/11
to Anthony, ABySS
Hi Anthony,

k=31 is quite small for 75-bp reads with 25x coverage. Have you tried
larger values of k? Say around 50?

Cheers,
Shaun

Reply all
Reply to author
Forward
0 new messages