hybrid Illumina and 454 assembly

d f

unread,

Jun 12, 2012, 4:08:16 PM6/12/12

to sga-users

A group at Cofactor Genomics studied different methods of de novo
assembly combining Illumina and 454 reads. They concluded that, for
genomes > 10 Mb, it is better to first pre-assemble the 454 reads, and
then combine the 454 pre-assembled fragments with the Illumina reads
for the final assembly.

For details on their work, see their website and accompanying poster
presented at AGBT 2012:

http://www.cofactorgenomics.com/blog/2012/strategies-de-novo-assembly-genomes-and-transcriptomes-using-combined-illumina-and-roche-4

Do you have any thoughts on whether this might work with SGA?

Since 454 contigs have more indel errors due to homopolymers, would
this be a problem for SGA assembly?

If I were to include pre-assembled 454 fragments (or contigs) in a
hybrid assembly, where would they get included in the SGA pipeline?
With the merged Illumina reads (the output of sga fm-merge, merged.k
$CK.fa), prior to the final indexing, computing the string graph
structure, and assembly? Or elsewhere in the pipeline?

Thanks for any thoughts you might have on this!

d f

Jared Simpson

unread,

Jun 13, 2012, 8:27:37 AM6/13/12

to sga-...@googlegroups.com

Hello d f,

Thanks for the link to the poster, it is an interesting idea. It may work with SGA but its hard for me to predict how well it will go. If you are interested in giving it a shot, here are my thoughts.

I suggest you introduce the 454 contigs before the sga-filter stage. You can either concatenate the Illumina reads file and 454 contig file together then index or build an index of the Illumina reads and an index of the 454 contigs separately, then merge them. I would do the latter.

By introducing the 454 contigs before sga-filter, most of the Illumina reads will be eliminated by sga-filter's substring check, vastly reducing the size of the data. Ideally, you will be left with just the original 454 contigs and the Illumina reads that are in the gaps between 454 contigs. You might be able to use sga-overlap to directly build the graph from this data set instead of fm-merge/index/overlap.

I think the biggest problem will be errors in the 454 contigs. Ideally, they will form bubbles in the graph which will be resolved by sga-assemble. I am not sure how well this will work. If most 454 contigs end due to coverage breaks, the ends of the contigs might have poor-quality sequence. You might consider trimming the ends of the 454 contigs (with quality scores if possible, otherwise just get rid of the last X bp). You could also try to correct the 454 contigs using the Illumina reads but this would take some work.

Finally, you might want to try the "-r" parameter of sga-assemble. Since you have long contigs in the graph, you can use this parameter to help resolve short repeats using longer overlaps. I suggest starting with "-r 20".

I hope that helps, let me know how it goes or if you have more questions.

Jared

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

d f

unread,

Jul 11, 2012, 4:32:49 PM7/11/12

to sga-users

Hi Jared,

Here's the status on adding 454/Newbler contigs to create a hybrid 454-
Illumina assembly.

I quality-trimmed the 454 contigs before combining with the Illumina
reads, but I didn't use the Illumina reads to correct them. (I would
have tried that if the preliminary hybrid assembly, below, were
promising.)

When I added the trimmed 454 contigs before the filter step, a lot of
the reads were filtered out. The discard fasta file was about 40x
larger in size than the filter.pass fasta file. (In comparison, with
Illumina only, discard was 6x larger than filter.pass.) About 40% of
the 454 contigs were discarded, even when I set the --kmer-threshold
to 1. But that still left a lot of 454 contigs.

Upon assembly, retaining only contigs >= 200bp, the total size of the
hybrid genome was about 1/3 the size of the Illumina-only assembly
(and much smaller than the expected size of this genome). The contig
N50 was also an order of magnitude or two smaller (depending on
assembly parameter) than that of the Illumina-only assembly. Most of
the hybrid assembly consisted of extremely short contigs, < 200bp.

-----

I also tried adding the 454 contigs before the fm-merge step, but sga
fm-merge caused a core dump.

So creating a 454/Illumina hybrid assembly didn't work out as well as
I had hoped.

d f

Lee Mendelowitz

unread,

Jul 11, 2012, 6:03:55 PM7/11/12

to sga-...@googlegroups.com

Hi d f,

My guess is that trying to use the k-mer filter on both the 454 contigs and Illumina contigs together hurt your assembly because the filter ended up throwing out a good number of your 454 contigs and it forced you to choose a low k-mer threshold, which allowed some Illumina reads with errors into the assembly graph. It only takes one untrusted k-mer to throw out a read, which is why it is pretty easy for a long 454 contig to get thrown out.

One strategy might by to try filtering only the Illumina reads with the k-mer check, and then merging the index of 454 contigs with the filtered illumina reads, and then filtering the merged dataset without the k-mer check to remove any contained Illumina reads ('sga filter --no-kmer-check'). It sounds like this may be similar to what you tried when you got the core-dump. The "sga fm-merge --help" says that you cannot have duplicates in your read set before calling fm-merge, maybe this was the source of your crash. Filtering with '--no-kmer-check' will take care of this.

Best,

Lee

d f

unread,

Aug 3, 2012, 5:15:39 PM8/3/12

to sga-...@googlegroups.com

Hi Lee,

I took your suggestion, (i) combined the 454 contigs
with the filtered Illumina reads and indexed, (ii) filtered
the merged dataset without the k-mer check, (iii) ran
fm-merge, and (iv) the usual steps after that. Everything
worked smoothly.

The hybrid assembly is the expected size :) .

The hybrid assembly appears to be more complete than either
of my two single-platform assemblies (454/Newbler and
Illumina/SGA). To compare the single-platform vs hybrid
assemblies, I BLAT aligned both sets of single-platform
contigs as queries against the hybrid assembly (scaffolds +
unplaced) as reference. Examining random scaffolds from the
hybrid assembly and looking at the BLAT alignments of the
single-platform contigs, the hybrid filled in gaps in both
single-platform assemblies, especially gaps in the
454/Newbler assembly.

The hybrid isn't "missing" parts of the genome that the
single-platform assemblies had found. Nearly every
single-platform contig aligned to full length to the hybrid
assembly. I picked off the longest BLAT alignment of each
single-platform contig to the hybrid reference. Counting
the total number of bases of those longest alignments, they
make up 98.0 and 97.5% of the 454/Newbler and Illumina/SGA
assemblies, respectively.

To compare contiguity, I calculated the NG50s of all three
assemblies. (NG50 is N50 with a fixed genome size, which I
set to the expected genome size. All three assemblies have
different sizes, so it is not fair to compare their N50s.
See the Assemblathon 1 paper
(http://genome.cshlp.org/content/early/2011/09/16/gr.126599.111.abstract)
for more details on NG50.) Contig NG50s for the hybrid,
454/Newbler, and Illumina/SGA assemblies were 92909, 77353,
and 18383 bp, respectively.

At least in terms of contiguity, completeness and agreement
with the single-platform assemblies, the hybrid assembly
looks good!

d f

Jared Simpson

unread,

Aug 3, 2012, 5:28:25 PM8/3/12

to sga-...@googlegroups.com

Thanks for the great report, this will be very useful for other people trying hybrid assembly.

Jared

Reply all

Reply to author

Forward