[Important] STAR-Fusion Questions

Yogindra Raghav

unread,

May 14, 2021, 12:18:16 PM5/14/21

to STAR-Fusion

Hello STAR-Fusion Authors,

My name is Yogi Raghav.

I'm a computational technician in the lab of Professor Ernest Fraenkel at MIT & the Broad Institute.

I came across your wonderful tool, STAR-Fusion, for detecting gene fusion events with transcriptional data.

I'm working with Professors and graduate students at MIT, Harvard Medical School and McGill University on understanding gene fusion events in a specific disease using post-mortem transcriptional data.

Our project involves thousands of samples from multiple brain regions and has high-impact potential.

We ran some tests using the STAR-Fusion tool with over a dozen samples and had quite a few questions regarding the tool and it's output.

Your answers to these questions are greatly appreciated!

What’s an example of when to use --max_sensitivity vs --full_Monty?
- What output differences should we expect between the two flags?
What do the following output columns mean and how are they useful?
- est_J
- est_S
The annotation filter is supposed to remove red herring fusion predictions.
- Why is it that I still have fusions with annotation "GTEx_recurrent_StarF2019" when those are supposed red herrings?
- Are those false positive for cancer samples or are they false positives for having a real fusion event?(https://github.com/FusionAnnotator/CTAT_HumanFusionLib/wiki#red-herrings-fusion-pairs-that-may-not-be-relevant-to-cancer-and-potential-false-positives)
I didn’t find documentation to explain some values in the “annot” column. Can you please clarify what these mean (including examples for how to interpret the numbers)?
- "LOCAL_REARRANGEMENT:+/-:[some number]"
- "NEIGHBORS[some number]"
- "NEIGHBORS_OVERLAP:+/-:+/-:[some number]"
What exactly is the difference between INTERCHROMOSOMAL[chromosome and some Mb] and LOCAL_REARRANGEMENT:+/-:[some number]?
We have 1000s of Disease/CTRL samples and we hope to understand differential transcripts based on brain region + disease status.
- Are there any specific flags you recommend based on our project description?
- Are there recommendations for how to conduct downstream analyses with the abridged TSVs?

Brian Haas

unread,

May 14, 2021, 12:53:20 PM5/14/21

to Yogindra Raghav, STAR-Fusion

Hi,

responses below

What’s an example of when to use --max_sensitivity vs --full_Monty?
What output differences should we expect between the two flags?

You would use --max_sensitivity in order to capture lowly expressed fusions having minimal evidence for a fusion call.

The --full_Monty is 'dirty' and not generally recommended, but is a last resort to see if there's any evidence for fusions. The false positive rate is expected to be quite high here as lots of artifacts will be included. It's there only as a desperate measure to explore potential fusions in the data.

What do the following output columns mean and how are they useful?
est_J
est_S

These are 'estimated' read counts for junction and split reads, taking into account multiple-mappings and multiple candidate fusion isoforms where read evidence is shared among them (involving read-to-fusion-isoform mapping uncertainty). Fusion expression levels (FFPM) are based on these estimated values.

The annotation filter is supposed to remove red herring fusion predictions.
Why is it that I still have fusions with annotation "GTEx_recurrent_StarF2019" when those are supposed red herrings?
Are those false positive for cancer samples or are they false positives for having a real fusion event?(https://github.com/FusionAnnotator/CTAT_HumanFusionLib/wiki#red-herrings-fusion-pairs-that-may-not-be-relevant-to-cancer-and-potential-false-positives)

If you're not using the --full_MONTY option and haven't otherwise enabled the --no_annotation_filter, then GTEx fusions are supposed to be excluded. If that's not happening and you're using the latest STAR-Fusion code, let me know what your command is and I'll investigate it further. Note, old versions of STAR-Fusion didn't do this.

I didn’t find documentation to explain some values in the “annot” column. Can you please clarify what these mean (including examples for how to interpret the numbers)?
"LOCAL_REARRANGEMENT:+/-:[some number]"
"NEIGHBORS[some number]"
"NEIGHBORS_OVERLAP:+/-:+/-:[some number]"

The number should represent the distance between the gene pairs on the genome.

The NEIGHBORS annotation means that they're in close proximity on the same chromosome. The +/- indicates the transcribed strand for each gene on the genome in their respective order in the pairing.

LOCAL_REARRANGEMENT means that some form of local restructuring of the genome would be required for the genes to be fused in the proposed fusion context. Cis-splicing of neighboring genes is not an option for these.

What exactly is the difference between INTERCHROMOSOMAL[chromosome and some Mb] and LOCAL_REARRANGEMENT:+/-:[some number]?

LOCAL_REARRANGEMENT is on the same chromosome and involve neighboring genes.

INTERCHROMOSOMAL means the candidate fusion genes are on different chromosomes.

We have 1000s of Disease/CTRL samples and we hope to understand differential transcripts based on brain region + disease status.
Are there any specific flags you recommend based on our project description?

The default parameters are recommended for general use.

Are there recommendations for how to conduct downstream analyses with the abridged TSVs?

We don't have cohort-level analytics integrated, but I'd suggest starting with generating a matrix of (fusion x sample) with FFPM values as the data, and follow up with clustered heatmaps and PCA or tSNE-style plots.

hope this helps!

~brian

--
You received this message because you are subscribed to the Google Groups "STAR-Fusion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to star-fusion...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/star-fusion/e7bdd7ec-f578-46ff-a058-d6701b577ad7n%40googlegroups.com.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Yogindra Raghav

unread,

May 14, 2021, 3:59:16 PM5/14/21

to STAR-Fusion

Thanks for the unbelievably quick reply! :)

With est_J/est_S, the thing that was confusing is that it seemed redundant since there are the "JunctionReadCount" and "SpanningFragCount" columns.

Apologies for not clarifying that.

1. What is the difference between "JunctionReadCount" and "est_J" (same for "est_S" and "SpanningFragCount")?

2. With regards to the "red herring" stuff, I used the latest trinityctat/starfusion image which uses STAR-Fusion version 1.10.0

The command I've used with that version is:

STAR-Fusion --left_fq {fastq_R1} --right_fq {fastq_R2} --genome_lib_dir {/path/to/genome/} --CPU 8 --output_dir {out_dir} --FusionInspector validate

Can you let me know why I would be getting these "false positives"?

Are they only false positives in the cancer space or would they be "false positives" in all disease spaces?

NOTE: I've attached a sample abridged TSV output file from STAR-Fusion as an example.

3. Outside of using more CPUs, are there any specific flags or options that will help with increasing speed of processing the large amount of data we have?

4. I meant to compare LOCAL_REARRANGEMENT with INTRACHROMOSOMAL but you've answered my question by explaining LOCAL_REARRANGEMENT.

5. I'm debating whether the "--FusionInspector validate" flag is useful or not. Do you think it would be in my case? Better yet, are there specific use cases for this flag?

6. Since we are looking at Transcriptomics data, how can we detect whether "fusion candidates" are ACTUAL gene fusions vs trans-splicing occurring?

Thanks so much once again,

Yogi

CGND-HRA-00241_star-fusion.fusion_predictions.abridged.tsv

Brian Haas

unread,

May 14, 2021, 8:01:43 PM5/14/21

to Yogindra Raghav, STAR-Fusion

Hi,

additional responses below

On Fri, May 14, 2021 at 3:59 PM Yogindra Raghav <yra...@mit.edu> wrote:

Thanks for the unbelievably quick reply! :)

With est_J/est_S, the thing that was confusing is that it seemed redundant since there are the "JunctionReadCount" and "SpanningFragCount" columns.
Apologies for not clarifying that.

1. What is the difference between "JunctionReadCount" and "est_J" (same for "est_S" and "SpanningFragCount")?

The JunctionReadCount (and SpanningFragCount) are the 'raw' counts of evidence reads. If reads are multi-mapping, they're each counted as full reads here and wherever else they contribute to evidence. The est_J and est_S have reads split across multiply mapped sites (similar to how RSEM or kallisto works).

2. With regards to the "red herring" stuff, I used the latest trinityctat/starfusion image which uses STAR-Fusion version 1.10.0
The command I've used with that version is:

STAR-Fusion --left_fq {fastq_R1} --right_fq {fastq_R2} --genome_lib_dir {/path/to/genome/} --CPU 8 --output_dir {out_dir} --FusionInspector validate

Can you let me know why I would be getting these "false positives"?
Are they only false positives in the cancer space or would they be "false positives" in all disease spaces?

NOTE: I've attached a sample abridged TSV output file from STAR-Fusion as an example.

Thanks for sharing this. I apparently didn't include the filtering logic in the last release. I'll post a message to the star-fusion group about this shortly. If you replace this file in your ctat genome lib: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/AnnotFilterRule.pm

then the annotation filters will be properly enabled.

3. Outside of using more CPUs, are there any specific flags or options that will help with increasing speed of processing the large amount of data we have?

If you're planning to run large numbers of samples, your best bet would be to leverage cloud computing. We have the system running on Terra: https://app.terra.bio/#workspaces/ctat-firecloud/ctat-mutations and it's highly scalable.

Otherwise, the CPU parameter is the only thing we have to speed up the process. I highly suggest going with cloud computing though.

4. I meant to compare LOCAL_REARRANGEMENT with INTRACHROMOSOMAL but you've answered my question by explaining LOCAL_REARRANGEMENT.

5. I'm debating whether the "--FusionInspector validate" flag is useful or not. Do you think it would be in my case? Better yet, are there specific use cases for this flag?

If you're running large numbers of samples, I'd not include this as it would double the runtime. Later, if you have fusions of interest you want to explore, you could target those separately with FusionInspector.

6. Since we are looking at Transcriptomics data, how can we detect whether "fusion candidates" are ACTUAL gene fusions vs trans-splicing occurring?

From transcriptome data alone, you can't tell the difference. You'd need dna-level analysis to confirm genome rearrangements. If you find fusions showing up regularly in normal tissues, the simplest explanation is trans-splicing if it's not some kind of RT artifact.

hope this helps! also, apologies for the annotation filtering issue.

To view this discussion on the web visit https://groups.google.com/d/msgid/star-fusion/379befbc-2d92-4d17-8301-9f6ed1612cbcn%40googlegroups.com.

Yogindra Raghav

unread,

May 17, 2021, 11:06:15 AM5/17/21

to STAR-Fusion

Thanks so much again for these wonderful answers!

I just wanted to understand the "red herring" filter stuff better.

I totally get that the wrong annotation file was used.

However, are you saying that finding a fusion that's also documented in "GTEx_recurrent_StarF2019" REGARDLESS OF DISEASE is a false-positive result?

Is it a false-positive because these are fusions that have a lot of evidence for just normally occurring?

Brian Haas

unread,

May 17, 2021, 11:33:02 AM5/17/21

to Yogindra Raghav, STAR-Fusion

It could either be a fusion that routinely shows up in normal samples as an artifact of some sort (which would be a false positive), or it could be a real fusion in normal samples (from trans-splicing or long-range cis splicing). An upcoming version of FusionInspector will help you to differentiate between these two scenarios, but for now, if your interest is disease, you might just ignore the fusions that show up in normals unless there's something particularly interesting about them to you.

To view this discussion on the web visit https://groups.google.com/d/msgid/star-fusion/10114ec2-e977-4b34-bf5e-3ce4d10b8280n%40googlegroups.com.

Yogindra Raghav

unread,

May 17, 2021, 11:52:17 AM5/17/21

to STAR-Fusion

Alright, thanks so much for all the replies!
I really appreciate your help. :)

Brian Haas

unread,

May 17, 2021, 11:56:07 AM5/17/21

to Yogindra Raghav, STAR-Fusion

sure thing! best of luck!

To view this discussion on the web visit https://groups.google.com/d/msgid/star-fusion/dbd0ede1-8697-4e13-80ca-7e9b111d81a5n%40googlegroups.com.

Yogindra Raghav

unread,

Jun 7, 2021, 2:50:53 PM6/7/21

to STAR-Fusion

Hello Brian,

I'm about to undertake analysis of over 2K samples using STAR-Fusion.

Before I do that, I just wanted to make 100% sure that I'm using the updated/CORRECT annotation filtering rule file.

The link you sent me previously was:

https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/AnnotFilterRule.pm

The data as it currently is, includes the correct annotation filtering rule right?

Sorry if I'm being a bit extra annoying but I'd rather make sure everything's fundamentally sound before mass analysis.

Thanks,
Yogi

Brian Haas

unread,

Jun 8, 2021, 9:55:30 AM6/8/21

to Yogindra Raghav, STAR-Fusion

Yes, that's the correct/current annot filtering rules. Just replace the one in your current ctat genome lib installation and it should be fine.

best,

~brian

To view this discussion on the web visit https://groups.google.com/d/msgid/star-fusion/f7f437ef-7e63-4046-83a8-f09e47b2d411n%40googlegroups.com.

Yogindra Raghav

unread,

Jun 16, 2021, 1:14:24 PM6/16/21

to STAR-Fusion

Hello Brian,

I'm trying to programmatically do some checks to make sure multiple STAR-Fusion jobs have completed FULLY.

I wanted to do this with a script.

Curious but if there are bugs that happen during the analysis, the *.tsv files are not created?

In other words, the existence of the *.tsv files is proof that the analysis was 100% complete.

If not, would I parse through output logs and look for the "STAR-Fusion complete" string?

Thanks,
Yogi

Brian Haas

unread,

Jun 16, 2021, 3:24:13 PM6/16/21

to Yogindra Raghav, STAR-Fusion

Hi Yogindra,

If you have final .tsv files, I would assume everything worked as planned. If you're capturing exit codes, the process should exit zero on total success, and non-zero if any failure occurred (which should be standard practice for most tools, and that's what we generally rely on for success/fail status).

best,

~brian

To view this discussion on the web visit https://groups.google.com/d/msgid/star-fusion/993c27a5-6e85-4ebc-b6bf-b1c5d240021bn%40googlegroups.com.

Yogindra Raghav

unread,

Jun 17, 2021, 11:19:56 AM6/17/21

to bh...@broadinstitute.org, STAR-Fusion

Thanks so much Brian! :)

Yogindra Raghav

YogiOnBioinformatics™

Massachusetts Institute of Technology (MIT)

Department of Biological Engineering

Laboratory of Professor Ernest Fraenkel

Affiliate at Broad Institute of MIT and Harvard

LinkedIn: YogiOnBioinformatics

GitHub: YogiOnBioinformatics

Email: yra...@mit.edu

Alternate Email: yra...@broadinstitute.org

Office Address: 21 Ames St #16-244, Cambridge, MA 02142

From: Brian Haas <bh...@broadinstitute.org>
Sent: Wednesday, June 16, 2021 3:23 PM
To: Yogindra Raghav
Cc: STAR-Fusion
Subject: Re: [STAR-Fusion] [Important] STAR-Fusion Questions

Yogindra Raghav

unread,

Sep 22, 2021, 5:21:32 PM9/22/21

to STAR-Fusion

Hello Brian,

MANY MANY CONGRATS on the pre-print! :)

I had 2 follow-up questions.

1) What's the difference in meaning between "NEIGHBORS" vs "NEIGHBORS_OVERLAP" in the "annots" column of STAR-Fusion?

Is it just that if the SHORTEST POSSIBLE base pair distance between two genes is less than 10K base pairs away (0.00Mb), then its special category is "NEIGHBORS_OVERLAP"?

To clarify the "SHORTEST POSSIBLE" portion above, let's say you have Gene A that goes from chr1:1-10 and Gene B is from chr1:100-500 (both are on positive strand).

If you have a fusion between the two, the shortest distance between the genes is the start point of Gene B (100) subtracted by the end point of Gene A (10).

Please let me know if my understanding of "NEIGHBORS_OVERLAP" is wrong.

2) Along the same vein of the last question, I'm still trying to understand "[some number]" in ONE example such as: "LOCAL_REARRANGEMENT:+/-:[some number]".

You mentioned previously that "[some number]" is the genetic distance between two genes.

This doesn't make sense to me based on the following example.

I found the following fusion: AC010332.1--ZNF880 which had the following "annot" info: [""INTRACHROMOSOMAL[chr19:0.01Mb]"",""LOCAL_REARRANGEMENT:+:[6864]""]

Looking on UCSC, they're approximately 0.01Mb apart using my "SHORTEST POSSIBLE" distance idea above.

So, what does "6864" represent?

Much appreciated,

Yogi

Brian Haas

unread,

Sep 23, 2021, 2:57:38 PM9/23/21

to Yogindra Raghav, STAR-Fusion

Hi Yogindra,

responses below

On Wed, Sep 22, 2021 at 5:21 PM Yogindra Raghav <yra...@mit.edu> wrote:

Hello Brian,

MANY MANY CONGRATS on the pre-print! :)

thanks!

I had 2 follow-up questions.

1) What's the difference in meaning between "NEIGHBORS" vs "NEIGHBORS_OVERLAP" in the "annots" column of STAR-Fusion?
Is it just that if the SHORTEST POSSIBLE base pair distance between two genes is less than 10K base pairs away (0.00Mb), then its special category is "NEIGHBORS_OVERLAP"?

It should mean that the gene spans overlap. There's a gene spans file in the ctat genome lib that has this info for all the genes.

The distance is length between the spans there.

To clarify the "SHORTEST POSSIBLE" portion above, let's say you have Gene A that goes from chr1:1-10 and Gene B is from chr1:100-500 (both are on positive strand).
If you have a fusion between the two, the shortest distance between the genes is the start point of Gene B (100) subtracted by the end point of Gene A (10).

Please let me know if my understanding of "NEIGHBORS_OVERLAP" is wrong.

The overlaps should overlap as far as the annotated gene spans go.

2) Along the same vein of the last question, I'm still trying to understand "[some number]" in ONE example such as: "LOCAL_REARRANGEMENT:+/-:[some number]".
You mentioned previously that "[some number]" is the genetic distance between two genes.

This doesn't make sense to me based on the following example.
I found the following fusion: AC010332.1--ZNF880 which had the following "annot" info: [""INTRACHROMOSOMAL[chr19:0.01Mb]"",""LOCAL_REARRANGEMENT:+:[6864]""]

the local rearrangement should be like the neighbors but the strands are flipped between the two genes.

This is the component that adds all the annotations:

https://github.com/FusionAnnotator/FusionAnnotator/wiki

in case that helps.

To view this discussion on the web visit https://groups.google.com/d/msgid/star-fusion/f3c60b81-c62f-45dc-9b24-79be1059c846n%40googlegroups.com.

Yogindra Raghav

unread,

Sep 23, 2021, 4:08:13 PM9/23/21

to STAR-Fusion

Thanks for answering the questions!

I checked out the gene spans file and everything makes sense now regarding "NEIGHBORS_OVERLAP".

I'm now confused about "LOCAL_REARRANGEMENT".

You mentioned long ago that this indicates cis-splicing is not a possible answer and some local restructuring must have happened to allow for the fusion.

However, in the above, you mentioned that it's "like the neighbors but the strands are flipped between the two genes".

I looked at an example "LOCAL_REARRANGEMENT" fusion from my data.

Both of the genes involved were from the same strand (-) AND that matched the reference in the "gene spans" file.

Do you mind re-explaining what you mean? Apologies for my confusion.

Brian Haas

unread,

Sep 23, 2021, 4:41:05 PM9/23/21

to Yogindra Raghav, STAR-Fusion

ah, it can also involve cases where the genes are on the same strand but reordered.

So, they should be close to each other, but do not involve cis-splicing

To view this discussion on the web visit https://groups.google.com/d/msgid/star-fusion/b4212994-91f0-42d2-9420-9e9c5fbf6555n%40googlegroups.com.

Reply all

Reply to author

Forward