Question about small and/or microbial genomes and --trans_gtf (in Launch_PASA

Kris Alavattam

unread,

Dec 11, 2022, 2:05:25 PM12/11/22

to pasapipeline-users

Hi Brian and community,

Apologies for the off-topic nature of this question.

I'm interested to include a .gtf from StringTie for argument --trans_gtf when building a comprehensive transcriptome database—i.e., when running Launch_PASA_pipeline.pl.

For transcriptome assembly, I'm working with S. cerevisiae Illumina RNA-seq data. I'm interested to know if, in your or anyone else's experience, StringTie should be called with altered parameters when working with organisms with small and/or microbial genomes. For example, I know that this is recommended when using genome-free and genome-guided Trinity with small and/or microbial genomes. No response from the StringTie developers so far, and my scans of the literature show that some authors have used default parameters in this or similar contexts.

Or, perhaps, do you not recommend using StringTie with small and/or microbial genomes, similar to your not recommending the use of Cufflinks in that context?

Thanks—any input will be appreciated,
Kris

Brian Haas

unread,

Dec 11, 2022, 2:32:15 PM12/11/22

to Kris Alavattam, pasapipeline-users

Hi Kris,

The main danger here is generating fusion transcripts from overlapping
transcripts (UTRs, mostly). If the data are strand-specific and you
can run stringtie in strand-specific mode, it could be fine. (If it's
not strand-specific, it'll be trouble w/ compact genomes). I'd just
suggest looking at your stringtie results first in IGV and/or run some
analyses like cuffcompare (or whatever the new version is called) to
compare your stringtie gtf to the reference gene structure annotation
and assess the level of fusion transcripts generated.

If things look good, you could use PASA to merge everything, but use
the option to require sufficient overlap among alignments to assemble
to again mitigate the neighboring fusion transcript issue.

hope this helps,

~b

> --
> You received this message because you are subscribed to the Google Groups "pasapipeline-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pasapipeline-us...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pasapipeline-users/66b4c1f3-3656-4b8d-892f-16e67d16908an%40googlegroups.com.

--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Kris Alavattam

unread,

Dec 11, 2022, 3:40:29 PM12/11/22

to pasapipeline-users

Thanks, Brian—yes, that's very helpful. When you mention "the option to require sufficient overlap among alignments to assemble to again mitigate the neighboring fusion transcript issue," you mean adjusting the --stringent_alignment_overlap parameter when running Launch_PASA_pipeline.pl—is that correct?

In trial experiments I'm running with PASA, in which I'm using .fasta files from genome-guided and genome-free Trinity (but nothing from StringTie/Cufflinks/etc. yet), so it makes me wonder if it would be helpful to increase the value for --stringent_alignment_overlap from 30.0 to perhaps something higher? (Currently, I'm calling Launch_PASA_pipeline.pl with --stringent_alignment_overlap 30.0, following the advice here.) If I understand things correctly, a higher percentage overlap for --stringent_alignment_overlap could/would mitigate the false identification of fusion transcripts that result from working with data from small, gene-dense genomes such as S. cerevisiae—is that right?

A little experimental context could be helpful here: We're working with a S. cerevisiae knock-out model that increases global antisense transcription, and we want to accurately identify these ncRNA transcripts and use the custom annotations in downstream analyses. In our work so far, we see a lot of both fusion and (apparently) fragmentary transcripts. Do you think that adjusting the value for --stringent_alignment_overlap could be useful in this context? Or perhaps leaving the --stringent_alignment_overlap at 30.0 is reasonable? Thinking of this, I'm reminded also of the --gene_overlap option available in Launch_PASA_pipeline.pl (which should be called together with the -L flag and --annots_gff3 option). Could calling Launch_PASA_pipeline.pl with --gene_overlap set to some value be potentially useful in this context?

Thanks! And thanks for these great programs and documentation,
Kris

Brian Haas

unread,

Dec 11, 2022, 7:11:23 PM12/11/22

to Kris Alavattam, pasapipeline-users

Hi Kris,

Responses below:

On Sun, Dec 11, 2022 at 3:40 PM Kris Alavattam <kalav...@gmail.com> wrote:
>
> Thanks, Brian—yes, that's very helpful. When you mention "the option to require sufficient overlap among alignments to assemble to again mitigate the neighboring fusion transcript issue," you mean adjusting the --stringent_alignment_overlap parameter when running Launch_PASA_pipeline.pl—is that correct?

Yes, that's right.

>
> In trial experiments I'm running with PASA, in which I'm using .fasta files from genome-guided and genome-free Trinity (but nothing from StringTie/Cufflinks/etc. yet), so it makes me wonder if it would be helpful to increase the value for --stringent_alignment_overlap from 30.0 to perhaps something higher? (Currently, I'm calling Launch_PASA_pipeline.pl with --stringent_alignment_overlap 30.0, following the advice here.) If I understand things correctly, a higher percentage overlap for --stringent_alignment_overlap could/would mitigate the false identification of fusion transcripts that result from working with data from small, gene-dense genomes such as S. cerevisiae—is that right?
>

It'll mitigate PASA contributing more to it, for sure, but it won't
address the problem for those cases where the input transcripts are
already fused. The 30% is probably fine.

> A little experimental context could be helpful here: We're working with a S. cerevisiae knock-out model that increases global antisense transcription, and we want to accurately identify these ncRNA transcripts and use the custom annotations in downstream analyses. In our work so far, we see a lot of both fusion and (apparently) fragmentary transcripts. Do you think that adjusting the value for --stringent_alignment_overlap could be useful in this context? Or perhaps leaving the --stringent_alignment_overlap at 30.0 is reasonable? Thinking of this, I'm reminded also of the --gene_overlap option available in Launch_PASA_pipeline.pl (which should be called together with the -L flag and --annots_gff3 option). Could calling Launch_PASA_pipeline.pl with --gene_overlap set to some value be potentially useful in this context?

Given the high quality of the reference annotations for S. cerevisiae,
using --gene_overlap is easily justified.

If I remember correctly, there aren't many introns in S. cerevisiae.
For antisense transcript to be properly identified as such, you'd need
to have to carefully take into account the transcribed orientation
based on the aligned orientation - with Trinity run in the
strand-specific modes to ensure proper transcript orientation during
reconstruction. Just something to be aware of, but you probably
already dealt with this given you're already deep into the process.

Hope this helps,

~b

> To view this discussion on the web visit https://groups.google.com/d/msgid/pasapipeline-users/2cafe91e-75b5-4d7b-b2ba-ccfbf5e28ba4n%40googlegroups.com.

Reply all

Reply to author

Forward

Question about small and/or microbial genomes and --trans_gtf (in Launch_PASA_pipeline.pl)

Kris Alavattam

Brian Haas

Kris Alavattam

Brian Haas