Impact of high coverage contaminating sequences on Abyss assembly

30 views

Skip to first unread message

Julie Hussin

unread,

Jun 19, 2014, 7:28:47 AM6/19/14

to abyss...@googlegroups.com

Hi,

I have been using Abyss to reconstruct contigs missing from the reference genome of the Platypus.
I have around 30 samples and I am reconstructing contigs from the reads that did not map to the reference genome.
In most samples, Abyss is giving great results, however I noticed a specific issue and I would like to know more about this.

In some individuals, the samples are contaminated with viral or bacterial DNA. When I run Abyss on these samples, I can reconstruct the entire genome of these micro-organisms, but nothing else! I then have to do a supplementary step of realigning all the unmapped reads to these reconstructed genomes to clean the data, and then I relaunch Abyss. In this second round, I get many contigs that are likely platypus DNA.

Although this iterative approach solves the problem (mostly), I am wondering why Abyss is not able to pick up the platypus contigs in the first place. I notice that this happens when these contaminating genomes are highly covered, so is there something in the Abyss algorithm that considers only these bacterial/viral k-mer as signal and the others as noise because of the differences in coverage between the sequences? This is a likely explanation for me, however I am not sure I understand at which step in the Abyss algorithm this would happen and I am curious of understanding this issue further.

Many thanks and best wishes,

Julie

--
Julie Hussin, PhD
Human Frontiers Postdoctoral Fellow
WTCHG, University of Oxford

Ben Vandervalk

unread,

Jun 19, 2014, 2:12:42 PM6/19/14

to Julie Hussin, abyss...@googlegroups.com

Hi Julie,

Yes, your hypothesis about higher read coverage of the contaminating genomes is the most likely explanation.

During the first stage of assembly (the de Bruijn graph stage), ABySS automatically calculates a minimum coverage cutoff 'c' based on where the peak lies in the kmer coverage histogram. Contigs that have a mean coverage lower than 'c' are then assumed be erroneous and are excluded from the assembly.

Ordinarily 'c' is calculated automatically, but you can override that by specifying 'c' on the command line, i.e.:

$ abyss-pe c=2 ...

- Ben

--
You received this message because you are subscribed to the Google Groups "ABySS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abyss-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages