OutSJfilter* OverhangMin and CountTotalMin

Assa Yeroslaviz

unread,

Sep 21, 2015, 6:03:48 AM9/21/15

to rna-star

Hi All,

I am running rna-star trying to map drosophila samples to the dm6 genome build.
We are interested in alternative splice junctions and therefore test those parameter to fit to our needs.
These are the parameters we are using which concern the output of the splice junctions, either in the bam file or the SJ.out.tab file:

--alignSJoverhangMin 8 --alignSJDBoverhangMin 5 --outSJfilterOverhangMin 30 12 12 12 --outSJfilterCountTotalMin 3 1 1 1

We were hoping to get in the SJ.out.tab file only SJ which have multiple reads overlapping them. But When I have looked at the data in the Genome Browser, I see a lot of reads which have "only" 1-4 reads, which is lower than the threshold for the annotated junctions.

I was wondering if these numbers are the numbers of the specific reads found during the mapping or do these numbers correspond to the number of collapsed and therefore the value 1 can mean more than just 1 read.
This apply also to the filter parameter used for the output. Does the parameter --outSJfilterCountTotalMin 3 1 1 1 means 1 collapsed read or 1 unique read?

How can we make sure that the SJ.out.tab will contain only those SJ with at least 5 real reads in it?

thanks
Assa

Alexander Dobin

unread,

Sep 22, 2015, 12:45:39 PM9/22/15

to rna-star

Hi Assa,

the outSJfilter* apply only to unannotated junctions, since the annotated ones are trustworthy.

Column 6 is the annotation status (0=no,1=yes).

After the SJ.out.tab is generated, you can filter all junctions, annotated and novel, by columns 7/8 (number of unique/multiple reads per junction.

--outSJfilterCountTotalMin filters novel junctions by the number of (unique+multi)mapping reads (i.e. col7+ col8), while

--outSJfilterCountUniqueMin filters novel junctions by the number of unique reads (i.e. col7)

Cheers

Alex

Assa Yeroslaviz

unread,

Sep 23, 2015, 10:15:38 AM9/23/15

to rna-...@googlegroups.com

Hi Alex,

thanks for the reply.
I still don't really understand the results of my output file (See also my question in seqanswers.com).

I have set the parameters to give me only those junction in the SJ.tab.out file, which have atleast 5 (or 8) reads covering them. But when I look at the output I have also for the un-annotated junction some with the number 1.

here is the head of one of the outputs:
head FILE.STAR.SJ.out.tab
2L      8117    8192    1       1       1       26      0       35
2L      8117    8228    1       1       1       1       0       12
2L      11345   11409   2       2       1       2       0       15
2L      11519   11778   2       2       1       20      0       38
2L      12222   12285   2       2       1       32      0       37
2L      12929   13463   2       2       0       1       0       26
2L      12929   13519   2       2       1       170     0       38

If I am using these parameters
--outFilterType BySJout
--outSJfilterCountTotalMin 10 8 8 8
--outSJfilterReads default: All

--outSJfilterOverhangMin 30 12 12 12

--alignSJoverhangMin 8
--alignSJDBoverhangMin 5
--outFilterMultimapNmax 20

Why do i also get un-annotated junctions with less than 5 reads coverting them (see row 6 in my results file).

thanks
Assa

Alexander Dobin

unread,

Sep 23, 2015, 4:50:01 PM9/23/15

to rna-star

Hi Assa,

there are two parameters that control output of the junctions according to the read counts per junction: --outSJfilterCountTotalMin and --outSJfilterCountUniqueMin.

Unlike other filters, these two work in the OR mode, i.e. passing one of them is enough. Note that Unique counts are always <= Total counts.

Since the default value of --outSJfilterCountUniqueMin is 3 1 1 1, a canonical junctions with 1 unique read support is allowed.

If you want to only use --outSJfilterCountTotalMin filter, you need to switch off the --outSJfilterCountUniqueMin with very large numbers or -1 (e.g. -1 -1 -1 -1).

Cheers

Alex

Assa Yeroslaviz

unread,

Sep 24, 2015, 12:58:52 AM9/24/15

to rna-star

Thanks Alex for the explanation. I will try to set the parameter to -1 -1 -1 -1 ans see if I can get rid of all small numbers

Assa

Assa Yeroslaviz

unread,

Oct 2, 2015, 10:31:10 AM10/2/15

to rna-...@googlegroups.com

Hi Alex,

it is nor running according to what I have expected.
These are the parameters I am using:
—outFilterMultimapNmax 10
—outFilterMismatchNoverLmax 0.4
—alignIntronMax 100000
—alignSJoverhangMin 8
—alignSJDBoverhangMin 5
—outFilterType BySJout
—outSJfilterCountTotalMin 8 5 5 5
—outSJfilterCountUniqueMin -1 -1 -1 -1
—outSAMtype BAM SortedByCoordinate
—outWigType wiggle read1_5p
—outReadsUnmapped Fastx
—outSAMstrandField intronMotif
—outSAMattributes All
—quantMode TranscriptomeSAM GeneCounts
(The important one for this question are in bold)

I want only those SJ in my SJ.out.tab file, which have at least 5 reads ( or 8 for the non-canonical motifs).

I am running the command and it looks quite good, but not good enough. I still get this kind of rows:
awk 'FNR==5442 {print}' STAR.SJ.out.tab
2L 11438179 11487284 1 1 0 1 0 22

This row is unannotated (column 6) and have "only" one unique and no multi-mapped reads overlapping this region. I don't want these reads to be in my SJ.out.tab file

I know I can use awk to filter all the junctions with less than 5 reads, but
How can I discard them upfront?

thanks

Assa

Alexander Dobin

unread,

Oct 5, 2015, 4:16:08 PM10/5/15

to rna-star

Hi Assa,

I cannot replicate this problem in my tests. Could you please send me the Log.out file of your run, the link to the genome and the minimal set of reads that replicates this problem.

Thanks!

Alex

Assa Yeroslaviz

unread,

Oct 6, 2015, 5:06:18 AM10/6/15

to rna-star

Hi Alex,

thanks again for the help

I'm attaching here the Log file and the minimal sam file of the region you've asked for. I extracted it by doing

samtools view STAR.sorted.bam 2L:11408179-11507284 > minimalSample.sam

The genome is too big, so i can't attach it. I am analysing the drosophila genome using the dm6 genome build.

minimalSample.sam

IFM_Myoblast_3.STAR.Log.out

Assa Yeroslaviz

unread,

Oct 6, 2015, 5:06:34 AM10/6/15

to rna-...@googlegroups.com

Hi Alex,

thanks again for the help

I'm attaching here the Log file and the minimal sam file of the region you've asked for. I extracted it by doing

samtools view STAR.sorted.bam 2L:11408179-11507284 > minimalSample.sam

The genome is too big, so i can't attach it. I am analysing the drosophila genome using the dm6 genome build. here is the link to genome I used - Drosophila_melanogaster.BDGP6.80.fa

minimalSample.sam

IFM_Myoblast_3.STAR.Log.out

Alexander Dobin

unread,

Oct 8, 2015, 1:02:42 PM10/8/15

to rna-star

Hi Assa,

I am looking at your command line as recorded in the Log.out file (it's marked with "##### Command Line:"), and I do not see the

--outSJfilterCountUniqueMin -1 -1 -1 -1 option. Please make sure that this option is supplied on the command line. You can also check "##### Final user re-defined parameters" in the Log.out file.