Output files and post-processing before scaffolding

Taj Azarian

unread,

Aug 17, 2015, 9:10:52 AM8/17/15

to Zorro - The masked assembler

Hello,

Can someone point me in the direction of better documentation which explains all of the output files that are produced? In addition to the ZORRO.contigs file, there are a number of files outputted and I would like to know which may help in assessing the merge.

Additionally, is there any post-processing that should be done before scaffolding with SSPACE? Specifically, should singlets be removed or contigs <100 bp, for example?

I hope some people are still monitoring the Google group!

Thank you!

Gustavo Lacerda

unread,

Aug 17, 2015, 10:21:26 AM8/17/15

to zorro-a...@googlegroups.com

Hi Taj,

Most files are intermediary files produced by Zorro pipeline. If Zorro finishes without errors or warnings, you can safely ignore/delete them. These files are useful to debug a failed Zorro run. In the next version, I will add a parameter to clean those files after Zorro finishes successfully.

Zorro outputs the final assembly into <prefix>.ZORRO.fasta. To assess the merge, look at this file. If the merge was good, you will find that:

->HybridContigs are contigs that were formed using contigs from both input assemblies and, even if the original input contigs contain repetitive regions, there was enough overlap of unique (non repeat) regions that allowed to merge them unambiguously. HybridContigs are the most confident contigs. If both assemblies are of good quality and they cover, individually, most of the genome and the genome is not ultra-repetitive, HybridContigs will be large and contain most of the assembly bases.

->RepHybridContigs are contigs that were formed using contigs from both input assemblies, but the overlaps between input contigs are in repetitive regions. These contigs are suspicious and could be misassembled.

->Singlets are contigs that appear in only one of the input assemblies. If assembly1 and assembly2 were produced using different sequencing technologies, each will potentially have different coverage bias. Thus, each assembly, could cover parts of the genome that the other assembly missed. They could also be generated by errors in the input assemblies.

The ids of the sequences in <prefix>.ZORRO.fasta can be used to discriminate HybridContigs, RepHybridContigs and Singlets. I usually remove contigs shorter than 300 because, empirically, I have found that many of them are misassembled and will cause problems to scaffolding tools. However, its up to you to decide if you will discard short contigs and the contig length cutoff.

After running Zorro, and before scaffolding, I suggest to run REAPR or PILON in the zorro contigs. These tools will align input reads against the assemblies and look for misassemblies and suspect regions. These regions could be broken, corrected or masked.

After that, you can run your scaffolding tool.

After scaffolding, I suggest to use a gap closure tool (if your scaffolding tool hasn't done it yet). After that, check and correct your assembly with either REAPR or PILON again.

Hope that this message is useful to you and other Zorro users until it is better documented.

Best regards,

Gustavo Gilson Lacerda Costa

Bioinformatician at State University of Campinas, Brazil

--
You received this message because you are subscribed to the Google Groups "Zorro - The masked assembler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zorro-assembl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Taj Azarian

unread,

Aug 17, 2015, 11:23:48 AM8/17/15

to Zorro - The masked assembler

Gustavo,

This is tremendously helpful. Also, I should have provided some background. I have two de novo assemblies of a Strep pneumoniae genome that were created with Velvet and SPAdes. They both assembled well. For example, Velvet produced 21 contigs after filtering with an N50 of 318226 and an average contig length of 99181. I sampled the raw reads to 15x and used Zorro to merge the assemblies, followed by SSPACE to scaffold and GapFiller to perform gap closing. I then ordered the contigs to a related reference genome using Mauve Contig Mover. I was then looking at the Sanger PAGIT tools to close the genome as best as possible, as my main goal is to create a reference for reference based assembly of closely related isolates. I will take a look at REAPR and PILON as you advised and hope that this moves me closer to my goal. Any additional advice to greatly appreciated!