STAR parameters for both read quantification and fusion finding

John S

unread,

May 10, 2020, 4:08:57 PM5/10/20

to rna-star

Hey

I wanted to try and something like this - run STAR once, and then use the outputs it produces for downstream quantification (via RSEM or Salmon), and fusion finding as well (via STAR-Fusion).

I first set the parameters on STAR as recommended by the RSEM toolkit (the Encode pipeline parameters).

Then I started adding in parameters as recommended by the STAR-Fusion toolkit (by looking at the SF code to see which parameters it uses).

Then I wanted to see how the quantification results (i.e. TPM) change after setting these additional parameters on STAR.

I analyzed this only for one sample currently, and wanted to just ask a few question before continuing on.

So, after running my quantification tool for different parameter-space configurations, I calculated the Pearson coefficient of correlation (if some other measure is better, please do let me know) with respect to the configuration where only the quantification parameters are set (Encode pipeline), and no STAR-Fusion parameters are present at all.

If adding in all parameters that STAR-Fusion 1.9 recommends, I get the Pearson coefficient of 0.95 (for my particular sample in question).

Then I started playing around with the parameters, and noted that the parameters that affect this the most are the peOverlap parameters (i.e. --peOverlapNbasesMin and --peOverlapMMp).

The defaults are 0 and 0.01, while the STAR-Fusion code sets this to 12 and 0.1, respectively.

If leave these two parameters to their default values, I get the Pearson coefficient of 0.99.

So I just wanted to write up my observing and possibly get an opinion on the feasibility of running such a pipeline.

Running alignment once and then doing downstream analysis would surely save a lot of time/compute-power, hence my wish to try this.

Even though the peOverlap parameters aren't in the Encode pipeline (they probably didn't even exist when the Encode pipeline was specified first), it doesn't seem to me that I should always explicitly leave them as defaults, even for quantification - based on the description of the peOverlap parameters, merging of reads with short insert sizes doesn't seem like something critical that should not be done (maybe even the opposite?).

If you have any comments on this (Alex or Brian), I'll gladly listen you your advice!

Thank you upfront for any answers as well!

Alexander Dobin

unread,

May 13, 2020, 6:36:03 PM5/13/20

to rna-star

Hi John,

this is an interesting observation, and a bit unexpected.

--peOverlapNbasesMin > 0 activates the algorithm that merges overlapped mates and remaps them as single-end.

This option is required for capturing chimeric junctions in overlapping mates.

However, I did not expect it would affect quantification strongly.

A couple of questions:

1. What is the distribution of insert sizes in your sample? The --peOverlapNbasesMin should only affect samples that have a large number of reads with insert size < 2*readLength (i.e. overlapping).

2. How do the mapping stats (Log.final.out) compare for the two runs?

Cheers

Alex

John S

unread,

Jul 16, 2020, 6:18:58 PM7/16/20

to rna-star

Hi Alex

So sorry for not replying earlier, google groups wasn't sending me notifications, and I totally sidetracked this thread, sorry again!

1. The insert size distribution of this sample looks like this:

mean: 287, median:390, min: 37, max: 1531706

2. The mapping percentage is close in all cases:

a. with just ENCODE parameters for quantification: 77.43%

b. with added parameters recomended by STAR-Fusion 1.9 (including the --peOverlapNBasesMin and --peOverlapMMp): 78.71%

c. with removing the --peOverlapNBasesMin and --peOverlapMMp from the previous scenario: 77.54%

Alexander Dobin

unread,

Jul 22, 2020, 11:24:51 AM7/22/20

to rna-star

Hi John,

thanks for looking into it!

The insert size is quite large in these samples (this should be compared to the read length - what was it?),

and so the merging of overlapping mates improves the mappability by only ~1%.

Still, this does not explain the small Pearson correlation coefficient.

My suspicion that this is caused by one (or very few) outlier genes, as outliers may strongly affect the Pearson correlation coefficient.

One scenario I had seen before is that one short non-coding RNA is highly expressed, but without merging mates its reads cannot be mapped.

Could you please make a scatter plot of gene expression values (log scale) for two sets of parameters?

Also, it may be better to use Sperman correlation or Pearson(log(TPM+0.1)) to mitigate the outliers.

Cheers

Alex

John S

unread,

Jul 24, 2020, 5:43:32 PM7/24/20

to rna-...@googlegroups.com

Hi Alex!

The read length is 100

The Pearson correlation that I reported was actually calculated as you described, log(TPM+0.1)

If I calculate the Spearman correlation in the same way, the results a bit better, tho similar:

0.994 (Spearman) compared to 0.991 (Pearson) between ENCODE parameters and SF1.9 parameters without peOverlap parameters

0.963 (Spearman) compared to 0.952 (Pearson) between ENCODE parameters and SF1.9 parameters with the peOverlap parameters

Here's the scatter plot, where the ENCODE case had expression levels >0 but <50

legend:

sf_one is the ENCODE parameter set

sf_1_9 is with added STAR-Fusion 1.9 parameters

sf_1_9_stripped is with added SF1.9 params, but without the peOverlap parameters

Here's the unfiltered scatter plot

So I guess there's some variations in these 'lowly' expressed genes

I can also attach the dataset with TPM values, in case I'm doing something wrong with analyzing this

Thanks for the answer by the way!

EDIT: for some reason, the images aren't loading for me on Google Groups. Is it the same for you? I can't even load the old images that I used to be able to see

filtered.png

Alexander Dobin

unread,

Jul 25, 2020, 6:30:06 PM7/25/20

to rna-star

Hi John,

thanks a lot for this investigation, I can see the figures all right.

So it seems this effect is not due to highly expressing outliers...

How much did the mappability change between the 3 cases, could you post the Log.final.out file?

Cheers

Alex

John S

unread,

Jul 27, 2020, 5:34:41 AM7/27/20

to rna-star

Hi Alex, sure, here are the Log.final.out files for all three case

encode_params.Log.final.out

sf-1-9_params.Log.final.out

sf-1-9-no-peOverlap_params.Log.final.out

Alexander Dobin

unread,

Aug 10, 2020, 1:01:58 PM8/10/20

to rna-star

Hi John,

thanks!

There is a small increase in mappability, with unique mappers up 1.4%, while multimappers are down by 0.3%.

I am not sure why it leads to a relatively large change in the RSEM gene expression. I wonder if it points out to some kind of instability.

If you are interested to continue investigating it,

the next step I would recommend is to assess gene expression with simple gene counting, i.e. use --quantMode GeneCounts, or run featureCounts.

I expect that the difference in gene expression with and without peOverlap should be much smaller than the one you see with RSEM - it will be interesting to see what comes out.

Cheers

Alex

John S

unread,

Aug 21, 2020, 7:44:34 PM8/21/20

to rna-star

Hi Alex

Sorry again for the long silence, again I wasn't receiving notification here

For comparing STAR counts between the three scenarios (basic quantification params, added SF1.9 params, and with added SF1.9 params but with removed peOverlap params), I got this:

This is straight from STAR, without any filtering. If I filter things out a little bit (set all expression levels where counts are less than 5 to zero, and then remove all genes where all three approaches have zero values), I get this:

So only in the case of applying this filtering, and calculating the Pearson coefficient, I again get "similar" results, i.e. 0.95 vs 0.99, for the cases with and without peOverlap parameters, comparing to the basic quantification parameter space.

Here's the scatter plot obtained from the filtered dataset

Plotting the un-filtered dataset just results in a straight line basically

So yeah, this was all interesting.. I'm still not sure how to interpret it though, if you have any thoughts, I'm all ears :)

Thanks once more for the interest in this!

Screen Shot 2020-08-22 at 1.34.06 AM.png

Alexander Dobin

unread,

Sep 1, 2020, 4:24:06 PM9/1/20

to rna-star

Hi John,

many thanks for the calculations!

Now I think it's becoming really interesting - since both quantification approaches confirm it, it cannot be some artifact of quantification. The fact that the correlation is lower after you filter low expressed genes is probably due to the gene that have 0 counts - there are a lot of them, and if you include them in R correlation, they will artificially drive it higher.

What is a bit strange - on the scatter plot it looks like "basic" counts are higher for most genes than peOverlap counts, while the total number of unique reads is higher for the peOverlap reads. I guess there must be some compensation for higher expressed genes.

If you want to keep digging into this topic, the next step would be to see which genes are the most affected - is there anything special about them, e.g. biotype, lengths, paralogs, etc? I would start with those genes that get 0 counts in peOverlap mapping, and largest counts in basic mapping.

On my side, I will try to check this effect on GM12878 standard ENCODE samples.

Cheers

Alex

John S

unread,

Sep 8, 2020, 4:00:26 PM9/8/20

to rna-star

Hi Alex!

This is interesting - when I look at the genes that you suggested (peOverlap mapping 0, and basic mapping largest counts), these all seem to be pseudogenes.

This was at a first glance, but I'll keep looking to see if I find some more correlating properties.

Thanks again for the suggestions, this research is very interesting!

Alexander Dobin

unread,

Sep 17, 2020, 11:08:52 AM9/17/20

to rna-star

Hi John,

this is very interesting indeed!

Pseudogenes generally get a number of false alignments - the ones for which splicing is harder to find than alignments with mismatches.

It could be the case here - it's easier to find spliced alignments for a merged SE reads.

Cheers

Alex

john15...@gmail.com

unread,

Sep 17, 2020, 11:18:08 AM9/17/20

to rna-star

Hi Alex

So, I guess it's ok then to assume that using the peOverlap parameters does not negatively affect the overall quantification procedure, just being aware of the different counts for pseudo-genes overall.

In the end, it seems viable doing alignment with STAR once, and then proceeding with multiple downstream applications (STAR-Fusion, RSEM/Salmon), without hurting any of the analyses.

Thanks a lot for your answers Alex in helping to understand this!

Cheers!

Alexander Dobin

unread,

Sep 18, 2020, 10:42:06 AM9/18/20

to rna-star

Hi John,

thanks again for the thorough investigation!

I think one last thing :) to confirm your conclusion is to calculate the correlations with pseudogenes omitted.

If it comes out high, it will confirm that the overlapped-mates calculation will produce similar gene expression to the standard one for "relevant" genes.

Actually, is probably doing a better job than the "standard" one for pseudogenes - i.e. detecting fewer of them.

Cheers

Alex

john15...@gmail.com

unread,

Sep 20, 2020, 7:01:19 AM9/20/20

to rna-star

Hi Alex

You were right on the spot! After removing pseudogenes, the correlations all go well into the 0.99s (Pearson, Spearman, filtered, non-filtered, TPMs, raw gene counts...).

Thanks for the immensely useful tips! I will also be testing this hypothesis now with additional samples, and eventually report here to confirm these findings.

Would you mind once more here reiterating the reason for why the inclusion of peOverlap parameters is detecting fewer pseudogenes (and thus doing a better job than the scenario where peOverlap parameters are not included)?

Again, thanks a million for you answers!

Alexander Dobin

unread,

Oct 21, 2020, 2:09:34 PM10/21/20

to rna-star

Hi John,

sorry for the belayed reply, your message slipped my attention.

My explanation (which requires some proof) is as follows.

Spliced alignments are harder to find than unspliced. This is true in general, but in this case specifically for the overlapping mates, each mate has to be splice through the same junctions separately to find the PE spliced alignment.

If the merge-overlapping-mates algorithm is used, this problem is alleviates, since only one sequence has to be spliced.

If correct spliced alignment cannot be found, often, the next best alignment is to the processed pseudogene - without splicing, but with a few extra mismatches.

This explains your observation the pseudogene counts get reduced when you switch on the merge-overlapping-mates algorithm.

Cheers

Alex

john15...@gmail.com

unread,

Oct 21, 2020, 6:07:47 PM10/21/20

to rna-star

Hi Alex

Wow, thanks again

So, when paired-end reads overlap, both read pairs must align to the same junction in order for STAR to report this as spliced alignment?

If one read aligns to the junction, but the other doesn't (lets say it's somewhere downstream of the junction), this won't count as a spliced alignment? Or does this depend on the overlap length itself?

If I'm missing something basic here, please forgive me 😅

Alexander Dobin

unread,

Oct 24, 2020, 2:03:34 PM10/24/20

to rna-star

Hi John,

sorry, I was not explaining it clearly. The splice junctions have to coincide in the overlap sequence, e.g.

Mate1: XXXXXXXXX------------------XXXX
Mate2:      XXXX------------------XXXXXXXXXXXXXXX

On the other hand, this is not allowed:
Mate1: XXXXXXXXX------------------XXXX
Mate2:   XXXX---------------------XXXXXXXXXXXXXXX