On 24.02.2015 04:42, Brian Haas wrote:
> Trinity is not compatible with colorspace. David Eccles (CC'd) is
> our local colorspace expert and will likely have some advice for
> you.
[Brian, any chance this can be put onto the website as a FAQ?]
If you're trying to use colour-space at all, my first advice is to run:
run away, or rerun your samples on a base-space system. Colour-space
is *really* confusing, and trying to explain why aligners produce odd
mappings is difficult. I wrote up some information on colour-space on
seqanswers a while back, which you can read if you want to know more:
http://seqanswers.com/forums/showpost.php?p=59156&postcount=4
So, let's assume you've come back from that and still like the idea of
colour-space. Maybe there's a strong financial incentive for that.
Next up is my post to the trinityrnaseq-users group about this:
On 11.10.2012 11:06, David Eccles (gringer) wrote:
> I was trying to shoehorn colourspace into Trinity a while back.
> There are a few issues with this format that make assembly
> considerably more difficult:
>
> 1. You can't treat double-encoded colourspace as base-space and
> expect things to work properly. The reverse of colourspace is the
> reverse of a sequence, rather than reverse-complement. The assembly
> algorithm needs to be modified to accommodate this (*in addition to
> double-encoding*), and you can't distinguish between some important
> codons and/or features. For example, a polyA tail is no different
> from a polyT head, so Trinity may happily join transcripts together
> with a polyA/polyT split.
>
> 2. There are two types of sequencing errors which have different
> outcomes on the determined sequence: Phasing errors will result in
> frame shifts (as per base-space sequencing), but incorrect colour
> addition in a sequence completely changes the interpreted sequence.
> The second error means that any sufficiently long sequence
> (say > 25-30bp) has about a 1 in 4 chance of having the wrong
> encoding at the end of the sequence. This could change stop
> codons to other amino acids, or convert polyA tails to polyG tails,
> for example.
>
> My initial idea was to double-encode sequences, then derive the four
> possible base-space encodings for each sequence in the Inchworm
> output. Unfortunately, because of the stated issues you get
> unnatural chimera assemblies, and the resultant assembled sequences
> aren't particularly useful.
>
> You would be best to avoid colour-space if at all possible. If you
> can't do that, try mapping your sequences to a similar well-assembled
> transcriptome to derive base-space sequences for each read, then
> assemble in Trinity using the derived reads.
Still interested? Then here's another post I wrote about how you might
got about doing it. The first suggestion follows similar lines to how
you assemble high error long reads (e.g. Nanopore, PacBio), which is to
use a process to correct the reads so that their errors in base-space
are similar to Illumina reads, and then treat the output as Illumina reads:
On 18.10.2013 02:49, David Eccles (gringer) wrote:
> Okay, so it seems your general question is "how do I shoehorn
> colour-space data into an RNA assembly program?"
>
> My first recommendation, which will save you time, pain, and probably
> money, is to re-run your samples using a non-colourspace system. If
> this can't be done, you might still come out better off by telling
> people that assembly of colour-space transcripts is not possible.
>
> The biggest problem I have with colour-space is that two different
> types of errors (base switch errors and sequence errors) are [possibly
> necessarily] represented by a single number. Two additional problems
> cause issues in RNA assembly:
> * base switching due to incorrect reads of colours makes guessing
> amino acids almost impossible
> * poly-A tails cannot be differentiated from poly-T heads (reverse
> complement), or any other long monomer, causing accidental chimeric
> transcripts
>
> Considering this, and throwing caution to the wind, there are two
> avenues that *might* work:
>
> 1. Use bowtie (version 1) to map the colour-space reads to your good
> (or not so good) reference assembly. The resulting SAM file will have
> the converted base-space sequence, with any errors due to base
> switching fixed up to match the reference. Extract out that base
> sequence into a new FASTA file, and continue as normal with Trinity.
>
> 2. Double-encode the colourspace data (by trimming off the first base
> then changing 1->A, 2->C, 3->G, 4->T), then run it through the
> Inchworm step as a strand-specific run [you are doing strand-specific
> sequencing, right?]. When that's done, convert the Inchworm
> transcripts back to colourspace, then convert to base space by
> generating the four different base-space sequences possible from each
> colourspace sequence (i.e. append A/C/G/T to the start of the numeric
> sequence). I think this might have been what I got Trinity to do a few
> moons ago. After that, run the transcripts through Chrysalis /
> Butterfly as per normal. If your colour-space sequencer has perfect
> reads, then there'll be no base switching in the middle of the
> colour-space version of the transcripts, so you might get a few good
> transcripts out the other end.
Hope this helps,
--
David Eccles
Bioinformatics Research Analyst, Gringene Bioinformatics
Room 2.10 x857
Malaghan Institute of Medical Research
http://www.malaghan.org.nz