Hello Shaun,
Thanks for the quick reply. From your email, I understand that one way of getting started would be to convert the two pairs of *.csfasta/*.qual files into two FASTQ files, and then convert the FASTQ files into base space in FASTA, before feeding the final pair of FASTA files into ABySS.
So – I could just convert to base space before feeding into ABySS. I've written a program for other work that already does such.
And my planned initial ABySS invocation would look like this:
abyss -pe k=25 n=10 in=" B_subtilis_F3.fasta B_subtilis_R3.fasta" name=B_subtilis_DSM10 <ret>
Does that look reasonable for a first try? (68 million reads, mate pair, from SOLiD 4) I figure I can then try altering k and n and get a better feel of where a sweet spot is.
However, Masa also replied to my email (see his reply below), and he made the point that I might lose a lot of info if I convert to base space before feeding into ABySS::
Ø The general approach is that it is a really bad idea to convert colorspace to basespace before assembly because of what single color errors do to the rest of the read. Having said that, ABySS does assemble colorspace, but it does not convert the colorspace to nucleotides after it is done. You will need to translate it to all 4 possible nucleotide sequences and then pick the right one once you finish the assembly.
I take his point – I have run into this before. Every time there is an error in color space, when doing a translation from color to base space I have to fill in all the remaining bases as unknown “N”s – no way to recover from an unknown in color space. Yeah, it sounds like a lot of information would get thrown away by all the unknown "N"s that would be fed into ABySS after translation to base space. And therefore I would like to keep input in color space for ABySS.
But Masa also says something about getting four possible seqs in base space as output from ABySS if I give it color space input and having to select one of the four somehow. I did not really understand that comment. It sounds complicated and not sure how I would process ABySS output. Why would the output in color space from ABySS give me multiple choices? Couldn’t I simply convert the output set of contigs in color space to base space contigs? (Of course, if ABySS keeps unknown colors in the output contigs, then I’ll have the same problem as with the short reads – all color-pairs after the unknown color will translate into unknown bases – no way to recover.) Can you give me some guidance as to what color space output would look like from ABySS and why there would be choices in translating such into base space? I tried looking for “SOLiD” in the email list archive and did not see anything directly pertaining.
One final question on the aligner usage: Masa also says:
> What you are also missing from the command line is aligner=kaligner, as the default does not actually support SOLiD.
From what I read, kaligner is the default aligner for FASTA input, so I should not have to specify using kaligner if I convert to base space in FASTA format first, before using ABySS on the data. Do I have that correct? But if I do stay with color space input, then I should explicitly specify “aligner=kaligner”? Also: I read that there are other possibilities for use as the aligner (map, bwa, bowtie, bwasw). Is kaligner what I should stick with? When do people make other choices?
Cheers,
Ron
Hi Ron,
You’ll have to convert the .csfasta and .qual files to .fastq format.
The FASTQ files should remain in colour space, but the program that converts the colour-space assembly to nucleotides is unmaintained and no longer works. You’ll have to find some other method to convert the colour-space assembly to nucleotides. Alternatively, you could convert the colour-space reads to nucleotides before assembly as you suggested. ABySS does not use the quality values other than to trim bases from the ends of the reads with quality less than some threshold (q<3 by default).
Cheers,
Shaun