How to use colorspace data in Trinity? Or What is best way for use SOLID data in Trinity?

Leandro de Mattos

unread,

Feb 23, 2015, 10:36:58 AM2/23/15

to trinityrn...@googlegroups.com

Dear, colleagues bioinformaticians,

I would like to know how to use trinity to run solid data (colorspace.fasta). In some way, can I use directly in this this format? If not, which program is best recommended to convert to base space?

Thanks for any help,

Leandro

Brian Haas

unread,

Feb 23, 2015, 10:42:06 AM2/23/15

to Leandro de Mattos, trinityrn...@googlegroups.com, David Eccles

Hi Leandro,

Trinity is not compatible with colorspace. David Eccles (CC'd) is our local colorspace expert and will likely have some advice for you.

best,

~brian

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Leandro de Mattos

unread,

Feb 23, 2015, 11:15:40 AM2/23/15

to Brian Haas, trinityrn...@googlegroups.com, David Eccles

Thanks Dr. Brian,

Best,

Leandro.

David Eccles (gringer)

unread,

Feb 23, 2015, 1:24:17 PM2/23/15

to Leandro de Mattos, trinityrn...@googlegroups.com

On 24.02.2015 04:42, Brian Haas wrote:
> Trinity is not compatible with colorspace. David Eccles (CC'd) is
> our local colorspace expert and will likely have some advice for
> you.

[Brian, any chance this can be put onto the website as a FAQ?]

If you're trying to use colour-space at all, my first advice is to run:
run away, or rerun your samples on a base-space system. Colour-space
is *really* confusing, and trying to explain why aligners produce odd
mappings is difficult. I wrote up some information on colour-space on
seqanswers a while back, which you can read if you want to know more:

http://seqanswers.com/forums/showpost.php?p=59156&postcount=4

So, let's assume you've come back from that and still like the idea of
colour-space. Maybe there's a strong financial incentive for that.
Next up is my post to the trinityrnaseq-users group about this:

On 11.10.2012 11:06, David Eccles (gringer) wrote:
> I was trying to shoehorn colourspace into Trinity a while back.
> There are a few issues with this format that make assembly
> considerably more difficult:
>
> 1. You can't treat double-encoded colourspace as base-space and
> expect things to work properly. The reverse of colourspace is the
> reverse of a sequence, rather than reverse-complement. The assembly
> algorithm needs to be modified to accommodate this (*in addition to
> double-encoding*), and you can't distinguish between some important
> codons and/or features. For example, a polyA tail is no different
> from a polyT head, so Trinity may happily join transcripts together
> with a polyA/polyT split.
>
> 2. There are two types of sequencing errors which have different
> outcomes on the determined sequence: Phasing errors will result in
> frame shifts (as per base-space sequencing), but incorrect colour
> addition in a sequence completely changes the interpreted sequence.
> The second error means that any sufficiently long sequence
> (say > 25-30bp) has about a 1 in 4 chance of having the wrong
> encoding at the end of the sequence. This could change stop
> codons to other amino acids, or convert polyA tails to polyG tails,
> for example.
>
> My initial idea was to double-encode sequences, then derive the four
> possible base-space encodings for each sequence in the Inchworm
> output. Unfortunately, because of the stated issues you get
> unnatural chimera assemblies, and the resultant assembled sequences
> aren't particularly useful.
>
> You would be best to avoid colour-space if at all possible. If you
> can't do that, try mapping your sequences to a similar well-assembled
> transcriptome to derive base-space sequences for each read, then
> assemble in Trinity using the derived reads.

Still interested? Then here's another post I wrote about how you might
got about doing it. The first suggestion follows similar lines to how
you assemble high error long reads (e.g. Nanopore, PacBio), which is to
use a process to correct the reads so that their errors in base-space
are similar to Illumina reads, and then treat the output as Illumina reads:

On 18.10.2013 02:49, David Eccles (gringer) wrote:
> Okay, so it seems your general question is "how do I shoehorn
> colour-space data into an RNA assembly program?"
>
> My first recommendation, which will save you time, pain, and probably
> money, is to re-run your samples using a non-colourspace system. If
> this can't be done, you might still come out better off by telling
> people that assembly of colour-space transcripts is not possible.
>
> The biggest problem I have with colour-space is that two different
> types of errors (base switch errors and sequence errors) are [possibly
> necessarily] represented by a single number. Two additional problems
> cause issues in RNA assembly:
> * base switching due to incorrect reads of colours makes guessing
> amino acids almost impossible
> * poly-A tails cannot be differentiated from poly-T heads (reverse
> complement), or any other long monomer, causing accidental chimeric
> transcripts
>
> Considering this, and throwing caution to the wind, there are two
> avenues that *might* work:
>
> 1. Use bowtie (version 1) to map the colour-space reads to your good
> (or not so good) reference assembly. The resulting SAM file will have
> the converted base-space sequence, with any errors due to base
> switching fixed up to match the reference. Extract out that base
> sequence into a new FASTA file, and continue as normal with Trinity.
>
> 2. Double-encode the colourspace data (by trimming off the first base
> then changing 1->A, 2->C, 3->G, 4->T), then run it through the
> Inchworm step as a strand-specific run [you are doing strand-specific
> sequencing, right?]. When that's done, convert the Inchworm
> transcripts back to colourspace, then convert to base space by
> generating the four different base-space sequences possible from each
> colourspace sequence (i.e. append A/C/G/T to the start of the numeric
> sequence). I think this might have been what I got Trinity to do a few
> moons ago. After that, run the transcripts through Chrysalis /
> Butterfly as per normal. If your colour-space sequencer has perfect
> reads, then there'll be no base switching in the middle of the
> colour-space version of the transcripts, so you might get a few good
> transcripts out the other end.

Hope this helps,

--
David Eccles
Bioinformatics Research Analyst, Gringene Bioinformatics
Room 2.10 x857
Malaghan Institute of Medical Research
http://www.malaghan.org.nz

Brian Haas

unread,

Feb 23, 2015, 4:10:55 PM2/23/15

to David Eccles (gringer), Leandro de Mattos, trinityrn...@googlegroups.com

Good idea David. I'll see if I can link directly to this post in the google group.

many thx!

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

David Eccles (gringer)

unread,

Feb 25, 2015, 1:21:19 PM2/25/15

to trinityrn...@googlegroups.com

On 26.02.2015 03:17, Leandro de Mattos wrote:
> Dear David,
> Please, I Think that you can to help me. I have a question, Can I use the script:
> velvet assembler : solid_denovo_preprocessor.pl <http://solid_denovo_preprocessor.pl> to obtain input for the Trinity ??
> Can I use file doubleEncoded_input.de for running Trinity.

Probably, but you're going to get lots of weird chimeras from homopolymer sequences joining together, and any consistent errors will
completely alter the base-space representation of the sequence. Your assembly is only likely to be useful in colour-space for mapping other
colour-space reads.

- David

Reply all

Reply to author

Forward