Hi Paritt. Thank you for testing Zorro
1. Is there a limit/recommended range on the k-mer size ?
Zorro is a assembly merger. It takes as input pre-assembled contigs (you can use Newbler/Mira/other to assemble 454 reads and velvet/soapdenovo/abyss or other to assemble the illumina reads, for example). The reads file which Zorro requires will be used only to AB INITIO REPEAT DETECTION, not to assembly. That's why we recommend to subsample the reads. The k-mer size of Zorro thus has a complete different meaning and doest not need to be optimized (as in most assembly softwares). The k-mer should be large enough to not expect k-length words ocurring in the genome by chance. k=22 is enough for E. coli since 4^22 is much larger than 5MB.
2. Illumina data is typically a lot more than 10x so do I need to sample out reads from illumina data before hand?
Not required, but subsampling will speed up things and consume less resources. Because the reads will not be used for assembly (only ab initio repeat detection) we do not need to use all reads. Since you have both 454 and Illumina, in your case you could use only 10X coverage of 454 reads, for example (as 454 typically provides more unbiased coverage of the genome)
I will update the documentation to clarify those aspects.
Best regards,
Gustavo
--
Gustavo Gilson Lacerda Costa
Bioinformatician at State University of Campinas (UNICAMP)
Work:(19)3521-6651 Cell:(19)9243-1559 Skype:gustavo.unicamp
www.researcherid.com/rid/B-6312-2009