new release: counting reads per gene

7405 views
Skip to first unread message

Alexander Dobin

unread,
Jun 19, 2015, 5:21:52 PM6/19/15
to rna-...@googlegroups.com
Dear All,

the release 2.4.2a implements the major new feature, counting of read numbers per gene, please see the release notes below.

Cheers
Alex

STAR 2.4.2a 2015/06/19

Counting reads per gene while mapping with --quantMode GeneCounts option.
A read is counted if it overlaps (1nt or more) one and only one gene. Both ends of the paired-end read are checked for overlaps.
The counts coincide with those produced by htseq-count with default parameters.

Requires annotations (GTF or GFF with --sjdbGTFfile option) used at the genome generation step, or at the mapping step.

Outputs read counts per gene into ReadsPerGene.out.tab file with 4 columns which correspond to different strandedness options:
column 1: gene ID
column 2: counts for unstranded RNA-seq
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
Select the output according to the strandedness of your data.
Note, that if you have stranded data and choose one of the columns 3 or 4, the other column (4 or 3) will give you the count of antisense reads.

With --quantMode TranscriptomeSAM GeneCounts, and get both the Aligned.toTranscriptome.out.bam and ReadsPerGene.out.tab outputs.

Malcolm Cook

unread,
Jun 26, 2015, 10:41:18 PM6/26/15
to rna-...@googlegroups.com
 Hi,

This is indeed welcome news.   

Can you confirm that with "-quantMode GeneCounts" a read (pair) gets counted toward a gene only if its alignment (possibly partially) overlaps any of that gene's exonic sequences (from any isoform), and with only that gene.

In other words, a read aligning wholly  in intronic space would NOT get counted toward that gene.

I _think_ I understood that to be obvious, but, one of my colleagues was unsure, so, I wanted to triple check.

Thanks,

Malcolm Cook

Alexander Dobin

unread,
Jun 30, 2015, 5:25:21 PM6/30/15
to rna-...@googlegroups.com, malcol...@gmail.com
Hi Malcolm,

Can you confirm that with "-quantMode GeneCounts" a read (pair) gets counted toward a gene only if its alignment (possibly partially) overlaps any of that gene's exonic sequences (from any isoform), and with only that gene.

In other words, a read aligning wholly  in intronic space would NOT get counted toward that gene.

Yes, this is correct. This definitions is the same as the "union" (default) mode of htseq-count.

Cheers
Alex

Malcolm Cook

unread,
Jun 30, 2015, 9:16:35 PM6/30/15
to rna-...@googlegroups.com, malcol...@gmail.com
Splendid news - exactly my hopes....
Best,
Malcolm

holger brandl

unread,
Sep 8, 2015, 8:33:50 AM9/8/15
to rna-star
Hello Alex,

built-in quantification is a great feature, but the way it works is imho a bit counter-intuitive and lacks consistency: Since users/me configure STAR according what kind of alignments they'd like to report in the bam-result, they'd expect the quantification mode to include all those alignments when doing the counting. However, it seems that multi-mappers are not counted at all when using --quantMode GeneCounts, even if they are reported in the bam-file.

You mentioned the htseq-count default defaults, but it's a downstream tool, which for which I'd look up an appropriate configuration before using it. The quantMode is built into STAR so naively I'd expected result files to be consistent (ie. what's in the bam should also be in the counts).

It took me a while to figure out this important detail, so I just wanted to share my thoughts as it might confuse other users as well. :-)

Cheers,
Holger

Alexander Dobin

unread,
Sep 9, 2015, 4:55:56 PM9/9/15
to rna-star
Hi Holger,

thanks for this comment, I shall emphasize in the manual that only unique mappers are counted.
My plan is to introduce more type of counting output, including the counting of multi-mappers.

Cheers
Alex

Luis Quintales

unread,
Oct 6, 2015, 2:38:11 PM10/6/15
to rna-star
Hi.

When counting reads per gene while mapping with --quantMode GeneCounts option (is a paired-end experiment), sometimes, column 2 is not the sum of column 3 and column 4. ¿Why?

For example:

ENSG00000157870.14      2207    44      2774
ENSG00000142606.15      63      576     98
ENSG00000142611.16      161     2       159
ENSG00000177133.10      26      0       26
ENSG00000227372.10      2402    77      2491
ENSG00000272153.1       83      806     9

Thanks,
Luis.

Alexander Dobin

unread,
Oct 6, 2015, 3:25:28 PM10/6/15
to rna-star
Hi Luis,

if you have reads that map to genes that overlap on the opposite strands, the unstranded count will count them as ambiguous, while the stranded counting will assign them to separate genes.
Hence the unstranded count is <= sum of stranded.

Cheers
Alex

Peng Huang

unread,
Jul 25, 2016, 12:05:18 PM7/25/16
to rna-star, malcol...@gmail.com
Hi, 
Actually, I found that the gene count outputs were different from the HTSeq-count's output (under union mode).
Overall, there were less unique mapped reads were assigned as "ambiguous" and slightly more reads were assigned as "no feature" by STAR.
Although I guess majority of those genes should be consistent. But I was a little confused about their differences.

Thanks,
Peng.

Alexander Dobin

unread,
Jul 27, 2016, 6:17:43 PM7/27/16
to rna-star, malcol...@gmail.com
Hi Peng,

do you have an example of this difference, preferrably a small set of reads, that you can share?
I have tested many examples to make sure that the counting is exactly equivalent to the htseq-count.

Cheers
Alex

Yuande Tan

unread,
Jul 28, 2016, 3:29:29 PM7/28/16
to rna-star
Hi Alex,
In STARgenome fold, there are seven output files: exonGeTrInfo.tab, geneInfo.tab,sjdbList.fromGTF.out.tab, transcriptInfo.tab,exonInfo.tab             sjdbInfo.txt,sjdbList.out.tab. I used less to look at them but I did not find column names. I also did not find explantation of these columns in the star 2.4.0.1 manual: 

So I have to need your help.

exonGeTrinfo.tab has four columns:

324748

3125090 3126847 2       1352    1872

3131135 3131188 2       1352    1872

3133067 3133192 2       1352    1872  
what is 324748? It is number of exons detected?
column1 is exon start position and column1 is end position, column3 is strand, what are columns 4 and 5?

exonInfo.tab has three columns

324748

0       1757    0

6045    6098    1758

7977    8102    1812

0       1992    0

3064    3201    1993

what are these three columns?


definition of columns in sjdbInfo.txt can be found your Manual.


transcriptInfo.tab has four columns: 


ENSG00000157870.14      2207    44      2774
ENSG00000142606.15      63      576     98
ENSG00000142611.16      161     2       159
ENSG00000177133.10      26      0       26
ENSG00000227372.10      2402    77      2491
ENSG00000272153.1       83      806     9

ENSG00000157870 is EnsEMBL Gene ID, what is .14?

Sorry, Alex, but I need your help to understand the output results before I use them.

Best wishes,

Yuande
 

Alexander Dobin

unread,
Jul 29, 2016, 3:45:17 PM7/29/16
to rna-star
Hi Yuande,

all these files are internal to STAR and should not be considered "output" files. They are used to record information from the GTF file in a simpified form for fast parsing by STAR.
In other words, they simply rehash the GTF file.

Cheers
Alex

W Lee

unread,
May 30, 2017, 3:37:04 PM5/30/17
to rna-star
Hi Alex,

I used STAR for my RNA-seq alignment. My RNA-seq library is prepared using Illumina True seq stranded protocol.
The library type is fr-firststrand.

I was having trouble using htseq-count for the bam files generated using STAR.
But I also used the option of "GeneCounts" as you mentioned.

Now I am confused about which column I should use in the gene count file generate by STAR.
Your inputs will be greatly appreciated.

Thank you!

Wendy

Alexander Dobin

unread,
May 31, 2017, 5:51:04 PM5/31/17
to rna-star
Hi Wendy,

I believe for Illumina Tru-seq stranded protocol you need to use the 4th column.
However, the general rule is to compare the total counts over genes in the 3rd and 4th column, and select the column which has much larger total counts.
If the counts in two columns are not very different, the protocol is unstranded and you need to use the 2nd column.
Yet even an easier method is to look at the N_noFeature line for the 3rd and 4th column and pick the column with the lowest count.

Cheers
Alex

Laurent Marc

unread,
Mar 15, 2018, 6:14:12 AM3/15/18
to rna-star
--Hi Alex,

thank you for this new implementation.
By the way i have a question: as it is mentionnned in this post: https://cgatoxford.wordpress.com/2016/08/17/why-you-should-stop-using-featurecounts-htseq-or-cufflinks2-and-start-using-kallisto-salmon-or-sailfish/
do you think "alignment-independent" tools such as kallisto,salmon or sailfish give better results in quantification than "alignment-dependent" tools such as featureCounts, HTSeq, STAR ?

Thank you,
Laurent --

Alexander Dobin

unread,
Mar 15, 2018, 10:32:30 AM3/15/18
to rna-star
Hi Laurent,

thanks for the interesting question - I have given it a lot of thought.

I had read this post and I think that the study is very biased. I am not going to list all the problems with the study design, but the main problem is that for the alignment methods, the only comparisons are made to simple counting and Cufflinks. If you look at kallisto and Salmon papers, the alignment-based RSEM always comes out on top.
Also, Salmon has an option to input aligned reads and the quantification with aligned reads is slightly better than with k-mers / pseudo-alignments.
As I see it, the advantage of "no-alignment" methods is speed, not the accuracy.

Now, we could ask ourselves how does simple counting compare (e.g. htseq-count) with the maximum likelihood models (e.g. RSEM, Cufflinks, Salmon with aligned reads).
The obvious disadvantage of the simple counting is that it only counts unique mappers, while the ML models "rescue" multimappers. This reduces the statistical power for counting. However, in realistic RNA-seq datasets with modern 2x100 PE reads, the proportion of multimappers is typically <10%. Of course, some genes with a lot of sequence similarity in the genome will be affected more than others.

On the other hand, the problem with the ML models can be exemplified as follows. Imagine that you have two single-transcript paralogous genes A and B with a lot of sequence similarity. In out first sample, 1 read maps uniquely to gene A and 3 reads - to gene B, and 1000 multimappers map equally well to both A and B. Any ML model will distribute the 1000 reads with 1:3 ratio, i.e. A get R1(A)=251 reads total, and B - R1(B)=753 reads. On the other hand, simple counting will just assign 1 read to A and 3 reads to B.
Now, imagine that in the 2nd sample we get the same 1000 AB multimappers, and 3 unique mappers for A and 1 unique mapper for B. Then ML will yield R2(A)=753 and R2(B)=251, and with such a large number of reads to support each gene, we would most certainly claim differential expression for A and B between samples 1 and 2. The simple counting, on the other hand, will tell us that we do not have the statistical power to call differential expression for these two genes - and this is the reality, of course.

To summarize - in my opinion, simple counting has a bit less statistical power, but is more robust.

Cheers
Alex

Lior Pachter

unread,
Mar 16, 2018, 2:36:55 PM3/16/18
to rna-star
Hi Alex, 

Your reply confounds a number of different issues. While it is true that *genomic* multi-mapping is not a major issue now that reads are longer, *transcriptomic* multi-mapping is still high, even with 100bp PE reads. That is to say, reads are frequently ambiguous between isoforms of genes. The main problem with htseq or featurecounts is that reads are not disambiguated between isoforms of genes, and when these isoforms have different lengths, the naïve counting methods can be very inaccurate. This is not an alignment issue but a quantification issue. In other words, simple counting is wrong because the total gene "counts" obtained by aggregating all reads that map to a gene locus is not, in general, going to be proportional to the gene abundance.

Lior

Alexander Dobin

unread,
Mar 16, 2018, 4:08:13 PM3/16/18
to rna-star
Hi Lior,

thanks for your comment!
Finally, I will learn all I always wanted to know about transcript quantification* (*but was afraid to ask).

Let me collect my thoughts and questions, and I will make an extended post over the weekend.

Cheers
Alex

Alexander Dobin

unread,
Mar 20, 2018, 11:28:53 AM3/20/18
to rna-star

Hi Lior,


I think there are two major questions that were indeed somewhat confounded in my previous post.


1. Counting vs Maximum Likelihood (ML) methods.

Read counting (e.g. htseq-count, featureCounts or STAR --quantMode GeneCounts) simply counts the number of uniquely mapped reads that overlap exons of each gene. The Maximum Likelihood methods (e.g. Cufflinks, RSEM, eXpress, SailFish, kallisto, Salmon), calculate the relative abundance of the isoforms (rather than genes) by maximizing the likelihood of observed alignments (Cufflinks, RSEM, eXpress) or k-mers (SailFish) or pseudo-alignments (kallisto, Salmon). In Lior’s words, ML models “disambiguate reads between isoforms of genes”.


As Lior pointed out, “when these isoforms have different lengths, the naïve counting methods can be very inaccurate”. While I agree with this statement, my question is:

How often does this lead to an actual error in differential expression calls in real data?

I would like to point to the paper by Soneson, Love, and Robinson which showed that this effect is practically undetectable in real data.


2. Pseudo-alignment vs. full alignment quantification.

The ML methods can use full alignments as input (Cufflinks, RSEM, eXpress) or pseudo-alignments (Kallisto, Salmon). The main point of my previous post was that pseudo-alignments do not provide more accurate quantifications than full alignments. For instance, the comparison of Kallisto and RSEM performance in Kallisto paper (Fig 2a) shows higher accuracy for RSEM: mean relative difference in estimated transcript read counts 0.03 for RSEM vs 0.05 for Kallisto.

Would you agree that pseudo-alignment quantification is less accurate than full-alignment ML quantification, and the advantage of the pseudo-alignments is only in the speed?


Cheers

Alex



On Friday, March 16, 2018 at 2:36:55 PM UTC-4, Lior Pachter wrote:

Dario Strbenac

unread,
Mar 21, 2018, 11:00:04 PM3/21/18
to rna-star
Good day,

There is a situation where all of the described methods do badly; estimating abundance of HLA genes. The reference genome sequence in the MHC region won't be representative of the gene sequences for the majority of samples because there are thousands of known alleles in the population. Normally, this is ignored by most RNA-seq analyses, but if the biologist asks why the gene counts values of the HLA-E row of your summary matrix doesn't match their experimental data, it demonstrates a need for those genes to be accurately quantitated by dealing with the reference bias.

A solution that I have implemented is to firstly replace the actual DNA sequences of such genes by Ns and map RNA-seq reads to the masked reference genome. Then, the reads without any mappings to the reference genome are mapped to the HLA allele database IMGT/HLA not allowing any mismatches and unlimited mapping locations, to be resolved by RSEM. The allele(s) with highest estimated counts correspond(s) to (the) allele(s) determined by sequence-based typing (SBT) of the same sample.

- Dario.

Ahmed Gamal

unread,
Jun 6, 2022, 12:59:36 PMJun 6
to rna-star
Hi Alex,

HTSeq has 3 modes to "handle reads overlapping more than one feature": union, intersection_strict and intersection_nonempty. which modes does STAR use for counting? And can we change the default mode?

Thank you!

Alexander Dobin

unread,
Jun 6, 2022, 1:00:49 PMJun 6
to rna-star
Hi,

STAR uses the "union" mode. This is only mode implemented for at the moment.

Cheers
Alex

Ahmed Gamal

unread,
Jun 6, 2022, 9:56:03 PMJun 6
to rna-star
Hi Dario,

My issue is that I get about 5-6 times more ambiguous reads than the counted ones. So, I was just wondering if STAR uses the "union" mode which is also the default mode in HTSeq and if there's a way to change that. There is a category of cases which the "union" mode classifies as ambiguous whereas the other 2 modes would normally count.


--
You received this message because you are subscribed to the Google Groups "rna-star" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rna-star/e7a3505e-03c5-4811-be2b-45e07e3b4008n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages