newbie question - first ABySS run on mate pair SOLiD data

Ron Taylor

unread,

May 23, 2012, 7:43:08 PM5/23/12

to ABySS, sjac...@bcgsc.ca, ronald...@pnnl.gov

Hello Shaun, everybody,

I spoke to Shaun a couple months ago about using ABySS. Shaun was
encouraging and said that ABySS would work on our SOLiD data, so -
computer support here has now got the parallelized version of ABySS
1.3.2 installed on a big Linux cluster, and I'm about to do my first
run, on SOLID data for an assembly of a strain of B subtilis - a
bacterium about 4 megabases in size.

However, when I read the documentation, I don't see instructions on
how to enter the two *.csfasta/*.qual file pairs on the command line.
I have four input files (instead of the one or two used in the ABySS
documentation examples);

[rtaylor@olympus B_subtilis_DSM10_Mate_Pair_June_2011]$ ls -lh B_sub*
-rw-r--r-- 1 rtaylor denovo 42G May 22 16:59
B_subtilis_DSM10_2011_05_26_F3.csfasta
-rw-r--r-- 1 rtaylor denovo 93G May 22 17:41
B_subtilis_DSM10_2011_05_26_QV_F3.qual
-rw-r--r-- 1 rtaylor denovo 94G Mar 6 13:01
B_subtilis_DSM10_2011_05_26_QV_R3.qual
-rw-r--r-- 1 rtaylor denovo 42G Mar 6 12:51
B_subtilis_DSM10_2011_05_26_R3.csfasta

Do I enter a command like this:

abyss -pe k=25 n=10 in=" <all four file names>" name=B_subtilis_DSM10
<ret>

I thought Shaun mentioned on the phone that *.csfasta and *.qual files
are acceptable as input, but I don't see any examples in the
documentation. So - wanted to double-check before starting a huge run
on the cluster.

Or must I convert each *.csfasta/*.qual file pair to a single FASTQ
file first, and then list the two FASTQ files on the command line?

Also: the FASTQ files would remain in color space, like the original
SOLiD output files. Is that acceptable, if indeed I should be using
FASTQ formatted files? If color space is not acceptable, then I can
convert the FASTQ files to base space, but I'll have to use an
artificial qual score for all the bases - any recommendations there
for use in ABySS?

Cheers,
Ron Taylor

Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory (U.S. Dept of Energy/Battelle)
Richland, WA 99352
phone: (509) 372-6568
email: ronald...@pnnl.gov

Shaun Jackman

unread,

May 25, 2012, 6:52:18 PM5/25/12

to Ron Taylor, ABySS, ronald...@pnnl.gov

Hi Ron,

You’ll have to convert the .csfasta and .qual files to .fastq format.

The FASTQ files should remain in colour space, but the program that converts the colour-space assembly to nucleotides is unmaintained and no longer works. You’ll have to find some other method to convert the colour-space assembly to nucleotides. Alternatively, you could convert the colour-space reads to nucleotides before assembly as you suggested. ABySS does not use the quality values other than to trim bases from the ends of the reads with quality less than some threshold (q<3 by default).

Cheers,
Shaun

Ronald Taylor

unread,

May 25, 2012, 8:55:18 PM5/25/12

to Shaun Jackman, ABySS, ronald...@pnnl.gov, Ronald Taylor

Hello Shaun,

Thanks for the quick reply. From your email, I understand that one way of getting started would be to convert the two pairs of *.csfasta/*.qual files into two FASTQ files, and then convert the FASTQ files into base space in FASTA, before feeding the final pair of FASTA files into ABySS.

So – I could just convert to base space before feeding into ABySS. I've written a program for other work that already does such.

And my planned initial ABySS invocation would look like this:

abyss -pe k=25 n=10 in=" B_subtilis_F3.fasta B_subtilis_R3.fasta" name=B_subtilis_DSM10 <ret>

Does that look reasonable for a first try? (68 million reads, mate pair, from SOLiD 4) I figure I can then try altering k and n and get a better feel of where a sweet spot is.

However, Masa also replied to my email (see his reply below), and he made the point that I might lose a lot of info if I convert to base space before feeding into ABySS::

Ø The general approach is that it is a really bad idea to convert colorspace to basespace before assembly because of what single color errors do to the rest of the read. Having said that, ABySS does assemble colorspace, but it does not convert the colorspace to nucleotides after it is done. You will need to translate it to all 4 possible nucleotide sequences and then pick the right one once you finish the assembly.

I take his point – I have run into this before. Every time there is an error in color space, when doing a translation from color to base space I have to fill in all the remaining bases as unknown “N”s – no way to recover from an unknown in color space. Yeah, it sounds like a lot of information would get thrown away by all the unknown "N"s that would be fed into ABySS after translation to base space. And therefore I would like to keep input in color space for ABySS.

But Masa also says something about getting four possible seqs in base space as output from ABySS if I give it color space input and having to select one of the four somehow. I did not really understand that comment. It sounds complicated and not sure how I would process ABySS output. Why would the output in color space from ABySS give me multiple choices? Couldn’t I simply convert the output set of contigs in color space to base space contigs? (Of course, if ABySS keeps unknown colors in the output contigs, then I’ll have the same problem as with the short reads – all color-pairs after the unknown color will translate into unknown bases – no way to recover.) Can you give me some guidance as to what color space output would look like from ABySS and why there would be choices in translating such into base space? I tried looking for “SOLiD” in the email list archive and did not see anything directly pertaining.

One final question on the aligner usage: Masa also says:

> What you are also missing from the command line is aligner=kaligner, as the default does not actually support SOLiD.

From what I read, kaligner is the default aligner for FASTA input, so I should not have to specify using kaligner if I convert to base space in FASTA format first, before using ABySS on the data. Do I have that correct? But if I do stay with color space input, then I should explicitly specify “aligner=kaligner”? Also: I read that there are other possibilities for use as the aligner (map, bwa, bowtie, bwasw). Is kaligner what I should stick with? When do people make other choices?

Cheers,

Ron

On Fri, May 25, 2012 at 3:52 PM, Shaun Jackman <sjac...@bcgsc.ca> wrote:

Hi Ron,

You’ll have to convert the .csfasta and .qual files to .fastq format.

The FASTQ files should remain in colour space, but the program that converts the colour-space assembly to nucleotides is unmaintained and no longer works. You’ll have to find some other method to convert the colour-space assembly to nucleotides. Alternatively, you could convert the colour-space reads to nucleotides before assembly as you suggested. ABySS does not use the quality values other than to trim bases from the ends of the reads with quality less than some threshold (q<3 by default).

Cheers,
Shaun

---------- Forwarded message ----------
From: maša <masa.mi...@gmail.com>
Date: Fri, May 25, 2012 at 1:37 AM
Subject: Re: newbie question - first ABySS run on mate pair SOLiD data
To: Ron Taylor <ronald....@gmail.com>

Hi Ron,

I have recently started using ABySS for mate-pair SOLiD data. I will try to help you from my own experience.

Yes, the csfasta/qual file should be converted to fastq format, or better said perhaps csfastq format. There is a Galaxy tool which does this quite well.

You should leave the sequence in colorspace with the primer base. I.e. "T12234..."

The general approach is that it is a really bad idea to convert colorspace to basespace before assembly because of what single color errors do to the rest of the read. Having said that, ABySS does assemble colorspace, but it does not convert the colorspace to nucleotides after it is done. You will need to translate it to all 4 possible nucleotide sequences and then pick the right one once you finish the assembly. The conversion is what the cs=1 switch in the command line was supposed to do, but it wasn't working last I tried.

What you are also missing from the command line is aligner=kaligner, as the default does not actually support SOLiD.

These were already all on the mailing list. You can perhaps search the google group with "colorspace" or "SOLiD" as there is quite a bit of knowledge here already. Consider using saet for example before assembly, etc.

Cheers,

masa

Shaun Jackman

unread,

May 28, 2012, 1:49:41 PM5/28/12

to Ronald Taylor, ABySS, ronald...@pnnl.gov

Hi Ron,

Your command line looks good.

A colour-space contig can be translated to four possible nucleotide contigs. For example:
>foo
00000
could be either AAAAAA, CCCCCC, GGGGGG or TTTTTT.

To decide which one, you can either map the reads to the colour-space contigs and call a nucleotide consensus at each position. Or, you can translate the colour-space contigs to all four possible nucleotide-space contigs and map them to a reference genome (same species or a closely-related relative) and keep the nucleotide sequences that align. The latter is the easiest if you have a reference.

If you assemble nucleotides, you can use the default (aligner=map), if you assemble colour-space, use aligner=kaligner

Cheers,
Shaun

Ronald Taylor

unread,

May 28, 2012, 5:23:19 PM5/28/12

to Shaun Jackman, ABySS, ronald...@pnnl.gov, Ronald Taylor

Shaun,

Thanks for the info. I get how I would have to map the base translation. I guess I was simply hoping I'd get a deterministic output for a contig from ABySS. That is, the FASTQ reads from the SOLiD software all are given starting "T"s, so there is only one possible translation to base space (presuming there are no bad colors in the read).

Here is one SOLID read in color space, in FASTQ format:

@2_25_106
T..01...000..20.2320233..113..031...110....00...013
+
!!76!!!%-&!!/%!2%%'%1,!!*.)!!*')!!!&/)!!!!%(!!!&%-

So - was hoping the ABySS software might carry that deterministic start through somehow as it builds its contigs. Yeah, maybe I was unrealistic - but one can hope.

Anyway, no worries. I just did a FASTQ color space to FASTA base space conversion using my Perl script and found that I lose very little info. Over the 76 million 50-base-long reads, only 4.0% of the reads have at least one bad color, and hence are candidates for discarding. If I allow up to five "N"s in the translated base sequence - that is, if the bad color appears toward the end of the color seq, so it does not affect many base calls - then that decreases the number of reads I have to discard to 3.9% of the total, adding 0.1% of useful reads. Not a big change - most bad colors show up fairly early in the reads. Either way, I'm simply going to convert to base space before feeding the reads into ABySS, since information loss is minimal (at least in these runs). I'm eager to get a run done and then start trying different ABySS param settings, see how that changes my contig set.

Again, thanks for the help on a holiday weekend.

Ron

Shaun Jackman

unread,

May 28, 2012, 5:34:10 PM5/28/12

to Ronald Taylor, ABySS, ronald...@pnnl.gov

Hi Ronald,

No worries at all. I’m happy to help.

You may want to consider correcting errors in the reads before assembly using the SOLiD Accuracy Enhancer Tool (SAET):
http://solidsoftwaretools.com/gf/project/saet/

Unfortunately, this web site seem to be down now. I’m not sure where it has moved to.

Cheers,
Shaun

Ronald Taylor

unread,

May 28, 2012, 5:54:33 PM5/28/12

to Shaun Jackman, ABySS, ronald...@pnnl.gov, Ronald Taylor

Shaun,

Yes, Masa mentioned SAET, too. I will look into it with our Life Tech support person. You are right, the site seems to have disappeared. Life Tech has been making some changes, what with the switch from BioScope software to LifeScope. There is a chapter (chap 16) in the LifeScope manual on using SAET. So - might simply be an integrated part of the LifeScope install now. I'll look into it.

Ron

Ronald Taylor

unread,

May 28, 2012, 8:28:27 PM5/28/12

to Shaun Jackman, ABySS, ronald...@pnnl.gov, Ronald Taylor

Shaun,

Another question just came to mind on the translation to base space before entry of the reads into ABySS:

If I discard those reads having what I consider too many Ns after the translation is done, then for a run like my current mate pair run, with an F3 FASTA file and an R3 FASTA file, the lengths of the two files will not match. I am sure that there will be some reads in, say, the F3 file that I discard whose corresponding R3 reads (with the same FASTQ id) will do better (perhaps no bad colors) and will stay in. So - the two files will get out of sync. The rows won't match up - the two reads for a given ID will occur at different locations in the two files, instead of on the same rows. And some reads in one file will miss a mate pair match in the other, since its corresponding read got discarded due to a bad color.

Can ABySS handle this? Or should I simply leave all the reads in the output, but with some of the reads that I create in base space having a lot of Ns in their sequences?

Ron

Shaun Jackman

unread,

May 28, 2012, 8:58:05 PM5/28/12

to Ronald Taylor, ABySS, ronald...@pnnl.gov

Hi Ron,

ABySS can handle reads with no mates, but if there’s a lot of them, it can slow down the assembly. It’s better to include the mate. You can trim it back though, and if it’s shorter than k bp, ABySS won’t use it.

Cheers,
Shaun

Reply all

Reply to author

Forward