Hi Lee,
I took your suggestion, (i) combined the 454 contigs
with the filtered Illumina reads and indexed, (ii) filtered
the merged dataset without the k-mer check, (iii) ran
fm-merge, and (iv) the usual steps after that. Everything
worked smoothly.
The hybrid assembly is the expected size :) .
The hybrid assembly appears to be more complete than either
of my two single-platform assemblies (454/Newbler and
Illumina/SGA). To compare the single-platform vs hybrid
assemblies, I BLAT aligned both sets of single-platform
contigs as queries against the hybrid assembly (scaffolds +
unplaced) as reference. Examining random scaffolds from the
hybrid assembly and looking at the BLAT alignments of the
single-platform contigs, the hybrid filled in gaps in both
single-platform assemblies, especially gaps in the
454/Newbler assembly.
The hybrid isn't "missing" parts of the genome that the
single-platform assemblies had found. Nearly every
single-platform contig aligned to full length to the hybrid
assembly. I picked off the longest BLAT alignment of each
single-platform contig to the hybrid reference. Counting
the total number of bases of those longest alignments, they
make up 98.0 and 97.5% of the 454/Newbler and Illumina/SGA
assemblies, respectively.
To compare contiguity, I calculated the NG50s of all three
assemblies. (NG50 is N50 with a fixed genome size, which I
set to the expected genome size. All three assemblies have
different sizes, so it is not fair to compare their N50s.
See the Assemblathon 1 paper
(
http://genome.cshlp.org/content/early/2011/09/16/gr.126599.111.abstract)
for more details on NG50.) Contig NG50s for the hybrid,
454/Newbler, and Illumina/SGA assemblies were 92909, 77353,
and 18383 bp, respectively.
At least in terms of contiguity, completeness and agreement
with the single-platform assemblies, the hybrid assembly
looks good!
d f