Contig size produced by ABySS

86 views
Skip to first unread message

Steve Taylor

unread,
Jul 13, 2011, 8:42:27 AM7/13/11
to abyss...@googlegroups.com
Hi,

I am trying to assemble 171194090 reads of 100bp paired-end Illumina of a particular mouse strain (~2.8 x 10^9bp). I
realised the coverage will be quite low but I am interested in getting at least some larger contigs :-).

I ran

abyss-pe \
ABYSS_OPTIONS='--illumina-quality' \
k=49 n=10 \
np=$MPI_NPROCS \
in='xxx.fastq' \
name=abysstest2

on a cluster, which generated:

abysstest2-1.adj
abysstest2-1.fa
abysstest2-1.path
abysstest2-3.adj
abysstest2-3.fa
abysstest2-3.sam.gz
abysstest2-bubbles.fa
abysstest2.e55246
abysstest2-indel.fa
abysstest2.o55246

I get the following output to STDERR

The minimum coverage of single-end contigs is 2.
The minimum coverage of merged contigs is 2.
abysstest2-3.hist: No such file or directory
make: *** [abysstest2-3.dist] Error 1
make: *** Deleting file `abysstest2-3.dist'


Is abysstest2-3.fa the final assembled contigs? I am puzzled because some of the contigs are as small as 48bp, which I
can understand if this is a intermediate file (so I guess it would be related to kmer size)but not if this is the final
assembly because surely the minimum contig length should be 100bp?

Thanks for any advice,

Steve

Tomoaki NISHIYAMA

unread,
Jul 13, 2011, 9:11:10 AM7/13/11
to ABySS, Steve Taylor, Tomoaki NISHIYAMA
Hi Steve,

Did you check error logs (abysstest2.e55246)?
The file list you presented looks as if single end assembly finished but,
the procedure aborted before paired-end assembly phase.
Since the genome size is quite large, you need a lot of memory in a single node
that is used to align the reads to the single end contigs (-3.fa) with KAligner.
The single assembly phase (ABYSS-P) works with distributed memory over the
cluster through MPI, but the alignment does not work in such way.

If you don't have sufficient memory for KAligner, you might try bwa or bowtie
in stead of KAligner. You will be able to find that information in this group.

Best regards,
--
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan

Steve Taylor

unread,
Jul 13, 2011, 9:52:29 AM7/13/11
to abyss...@googlegroups.com, Steve Taylor, Tomoaki NISHIYAMA
Hi,

Thanks for your reply.

abysstest2.e55246 is what I posted i.e.


The minimum coverage of single-end contigs is 2.
> The minimum coverage of merged contigs is 2.
> abysstest2-3.hist: No such file or directory
> make: *** [abysstest2-3.dist] Error 1
> make: *** Deleting file `abysstest2-3.dist'

How reliable is the bowtie alignment? What is the downside?

Steve


Eskeatnaf Mulugeta

unread,
Jul 13, 2011, 9:59:08 AM7/13/11
to Tomoaki NISHIYAMA, ABySS, Steve Taylor, Tomoaki NISHIYAMA
Hi Steve,
As Tomoki, explained it, your assembly stopped after the first stage,
after the ABYSS assembly, which is a parallel one. You can check the log
file "abysstest2.o55246" to see where it hangs. I think the it stopped
because KALigner does not have enough RAM to calculate and after that make
the dist and hist files.

You can try the following:
1 you can to do the quality control before you the assembly, which is very
important
2 try to run every lib in alone and align them independently and run
abyss-pe after that
3 try to test different kmers, in your case increase it to something above
50.

Happy assembly :-)
cheers
=====================================
Eskeatnaf Mulugeta

Erasmus university medical center, Rotterdam, The Netherlands

Matthew MacManes

unread,
Jul 13, 2011, 10:05:24 AM7/13/11
to abyss...@googlegroups.com, abyss...@googlegroups.com, Steve Taylor, Tomoaki NISHIYAMA
Interesting question: Has anybody looked at the difference in assembly quality between ABYSS-P + KAligner, BWA or Bowtie?

Maybe different depending on dataset, but not sure.

Matt

Kevin Karplus

unread,
Jul 13, 2011, 11:42:52 AM7/13/11
to abyss...@googlegroups.com

Steve,

For a 2x assembly to mouse, you are best off doing a mapping assembly,
not a de novo assembly. No de novo method can do much with only 2x data.

Unless you are specifically looking for structural variation (in which
case 2x paired-end data won't tell you very much), you should
porobably try using BWA and samtools mpileup to map your reads to a
known mouse genome.

------------------------------------------------------------
Kevin Karplus kar...@soe.ucsc.edu http://www.soe.ucsc.edu/~karplus
Professor of Biomolecular Engineering, University of California, Santa Cruz
Editorial Board, Bioinformatics (Oxford University Press)
Affiliations for identification only.

Alejandro Sanchez

unread,
Jul 13, 2011, 11:52:54 AM7/13/11
to abyss...@googlegroups.com
Hi Steve,

In addition to Kevin's comment, I would say as well that not even 2x
coverage would be enough for mapping... In our experience you want at
least 10x for a decent mapping, comparative assembly or SNP call.

Cheers.


--
Alejandro Sanchez-Flores
Team133 Parasite Genomics
Wellcome Trust Sanger Institute
Cambridge, UK.

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Steve Taylor

unread,
Jul 13, 2011, 1:19:39 PM7/13/11
to abyss...@googlegroups.com
Hi,

Thanks for all your helpful comments.

I am not actually trying to completely assemble the genome. I trying to locate a transgenic sequence in a strain of mouse. I am actually trying to just get some contigs that are large enough so I can detect mouse and a section of transgene in a single
contig and then take the mouse section and map that back to mouse genome. I have been trying a few other ways including looking at mapping one end of the pair to the transgenic sequence and the other to the mouse sequence but it isn't revealing much. Sadly, I don't have a complete assembly of the mouse strain or the insert (don't ask! :-))  but I have some 454 reads (for the transgene, which are partially assembled) and Illumina reads for the mouse.

If anyone on this list has any further suggestions that'd be great!

Steve

Rod Docking

unread,
Jul 13, 2011, 1:57:26 PM7/13/11
to abyss...@googlegroups.com, Steve Taylor
Hi Steve et. al:

First, thanks everyone in the group for your comments!

> abysstest2-3.hist: No such file or directory
> make: *** [abysstest2-3.dist] Error 1
> make: *** Deleting file `abysstest2-3.dist'
>
>

> Is abysstest2-3.fa the final assembled contigs? I am puzzled because some of the contigs are as small as 48bp, which I
> can understand if this is a intermediate file (so I guess it would be related to kmer size)but not if this is the final
> assembly because surely the minimum contig length should be 100bp?

As others have noted, abysstest2-3.fa is the output of the single-end assembly stage. It may be the case that KAligner failed due to memory issues, but with coverage at 2x, it's more likely that KAligner | ParseAligns failed to produce a proper histogram file (i.e., there was insufficient data).

If you'd like to work around this, you can create that file yourself, and give it the expected insert size distribution for your library - the format of the file is tab-separated text, first column insert size, second column count. Once you create that file, you can re-run abyss-pe and it should finish fairly quickly.

> Interesting question: Has anybody looked at the difference in assembly quality between ABYSS-P + KAligner, BWA or Bowtie?


Shaun can comment on this a bit more, but in our own testing, we prefer the KAligner results. Mostly because KAligner operates at the k value of the assembly, while the others use the full reads. E.g., for a k48 assembly using 100bp reads, bwa and bowtie will have difficulty aligning to the (many) 48bp contigs that are potentially resolvable repeats. For very large assemblies though, KAligner doesn't scale as well.

Again, as others have mentioned, 2x coverage of a full genome will make assembly very difficult. You could try treating it like a transcriptome assembly: assemble many k-values, and screen all the contigs for your transgene sequence. Some other ideas would be:

- Find all the reads that align just to your transgene sequence, and assemble just those reads with their mates
- Try to find reads spanning the mouse-transgene breakpoint in your read-to-genome alignments (i.e. search for the known transgene sequence - ideally it would be found in reads that were partially aligned to the genome, and partially soft-clipped).

Good luck,
Rod

Kevin Karplus

unread,
Jul 13, 2011, 2:06:06 PM7/13/11
to abyss...@googlegroups.com

Steve,

In a 2x set, you may not have any read pairs that span the transgene-mouse boundaries.

Your best bet is probably to map all the reads (individually) to the
454 reads of the transgene, and then look to see if one or two have a
paired end that maps to mouse.

If your coverage were higher, I'd recommend iteratively searching and
assembling, as I've done to assemble the banana slug mitochondrion
[ https://banana-slug.soe.ucsc.edu/computer_resources:assemblies:mitochondrion ],
but with only 2x coverage you won't be able to assemble the transgene,
and you only expect one or two reads to cross the boundary.

You are probably better off doing a couple Sanger sequencing reactions
with the starting primers pointing the reads out from the ends of the
transgene. High-throughput sequencing is a powerful technology for
getting lots of data, but it is not always the best for answering
questions about specific loci.

------------------------------------------------------------
Kevin Karplus kar...@soe.ucsc.edu http://www.soe.ucsc.edu/~karplus
Professor of Biomolecular Engineering, University of California, Santa Cruz

ex-Graduate Director, Biomolecular Engineering and Bioinformatics

Kevin Karplus

unread,
Jul 13, 2011, 2:20:23 PM7/13/11
to sta...@molbiol.ox.ac.uk, abyss...@googlegroups.com
Following up on Rod's comment. If you have 454 data and need a
histogram of insert size, try mapping the read pairs with BWA to the
454 reads.

We did that fairly successfully with our banana slug data, where we
had 10x of paired ends and 0.1x of 454.
https://banana-slug.soe.ucsc.edu/bioinformatic_tools:bwa#determining_paired-end_insert_size

Steve Taylor

unread,
Jul 14, 2011, 4:17:15 AM7/14/11
to abyss...@googlegroups.com
Some great suggestions. Thank you all.

Just a minor point...I thought that a coverage would be

num reads * length of read/expected genome size

So for my data:

171194090*100/2716965481
=~6.3

?

Thanks,

Steve
Reply all
Reply to author
Forward
0 new messages