Interpreting STARsolo summary statistics when mapping 10x data to transcriptome

Jacob Musser

unread,

Jun 18, 2019, 6:08:56 AM6/18/19

to rna-star

Hello,

I was very excited to see the new functionality for mapping droplet scRNAseq data in STAR. I have been testing this with some of our 10x data collected from a species of mollusc. The reference we are using is a de novo transcriptome, in which every transcript is a separate sequence in the fasta reference file, and the entire length of each transcript sequence is annotated as "exon" in the gtf file. So, there are no introns, and if a read uniquely maps, then it should count as being mapped to a gene.

In the summary files (e.g. Gene.stats), I see a substantial number of reads that are "nNoFeature". Shouldn't this be theoretically impossible in my situation, or am I misunderstanding what this category represents? Also, I observe different summary statistic numbers in Gene.stats and GeneFull.stats, although I don't have any sequence labelled as introns. Can you offer your thoughts on what would account for this? I pasted the STAR run command, Gene.stats, and GeneFull.stats below if this is helpful.

Sincerely,

Jake

STAR run command:

STAR --runThreadN 12 --soloType Droplet --soloStrand Unstranded --soloFeatures Gene GeneFull --soloCBwhitelist 737K-august-2016.txt --genomeDir gtf_final.gtf --sjdbGTFfeatureExon exon --readFilesCommand gunzip -c --readFilesIn AcrB_S2_L001_R2_001.fastq.gz AcrB_S2_L001_R1_001.fastq.gz

Gene.stats

Barcodes:

nNinBarcode 20631

nUMIhomopolymer 34843

nTooMany 0

nNoMatch 7566302

Gene:

nUnmapped 82691

nNoFeature 52950640

nAmbigFeature 24612

nAmbigFeatureMultimap 24612

nTooMany 82491

nNoExactMatch 83272

nExactMatch 89367182

nMatch 91095117

nCellBarcodes 267078

nUMIs 12051808

GeneFull.stats

Barcodes:

nNinBarcode 20631

nUMIhomopolymer 34843

nTooMany 0

nNoMatch 7566302

GeneFull:

nUnmapped 82691

nNoFeature 52846572

nAmbigFeature 24654

nAmbigFeatureMultimap 24654

nTooMany 82573

nNoExactMatch 83462

nExactMatch 89469225

nMatch 91198871

nCellBarcodes 267139

nUMIs 12058357

Alexander Dobin

unread,

Jun 18, 2019, 7:15:04 PM6/18/19

to rna-star

Hi Jacob,

I have not tested STARsolo with the transcriptome alignments... so you would need to be alpha-tester on that. :)

Could you also post the Log.final.out?

Does each exon in the GTF file have unique "transcript_id" and "gene_id"?

If you are mapping to the transcriptome, you may want to prohibit splicing, with --alignIntronMax 1 .

The spliced reads within each gene will be counted as "nNoFeature" for the "Gene" counts - which only count reads that agree with the transcript structure - which is always "unspliced" in your case.
However, all mapped reads should be counted in the GeneFull counts...

Cheers

Alex

Jacob Musser

unread,

Jun 19, 2019, 3:33:39 PM6/19/19

to rna-...@googlegroups.com

Alex,

No problem, happy to be an alpha-tester :).

For the GTF, there can be several transcript_ids per gene_id. Unless I misunderstand the STAR manual, your quantification at the gene level should count a read for a gene as long as it uniquely maps to a single gene_ID (even if it overlaps multiple transcripts with that geneID). Is this correct?

I am rerunning it now with --alignIntroMax, which I'm guessing should result in equivalent stats for Gene andGeneFull. I'll let you know what I get though.

Below is the Log.final.out from the original run:

Started job on | Jun 18 00:26:09

Started mapping on | Jun 18 00:26:15

Finished on | Jun 18 01:50:17

Mapping speed, Million of reads per hour | 108.49

Number of input reads | 151940599

Average input read length | 120

UNIQUE READS:

Uniquely mapped reads number | 63724056

Uniquely mapped reads % | 41.94%

Average mapped length | 108.58

Number of splices: Total | 69015

Number of splices: Annotated (sjdb) | 0

Number of splices: GT/AG | 20127

Number of splices: GC/AG | 2029

Number of splices: AT/AC | 776

Number of splices: Non-canonical | 46083

Mismatch rate per base, % | 1.78%

Deletion rate per base | 0.08%

Deletion average length | 2.04

Insertion rate per base | 0.08%

Insertion average length | 1.86

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 31518606

% of reads mapped to multiple loci | 20.74%

Number of reads mapped to too many loci | 1245837

% of reads mapped to too many loci | 0.82%

UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0

% of reads unmapped: too many mismatches | 0.00%

Number of reads unmapped: too short | 55347055

% of reads unmapped: too short | 36.43%

Number of reads unmapped: other | 105045

% of reads unmapped: other | 0.07%

CHIMERIC READS:

Number of chimeric reads | 0

% of chimeric reads | 0.00%

Alexander Dobin

unread,

Jun 24, 2019, 11:08:46 AM6/24/19

to rna-star

Hi Jacob,

>>> "quantification at the gene level should count a read for a gene as long as it uniquely maps to a single gene_ID (even if it overlaps multiple transcripts with that geneID)."

This is correct.

The Log.final.out looks reasonable, the number of spliced reads is low and cannot explain your observations. I will make some test on my own examples to see if I can reproduce the problems.

Cheers

Alex

Jacob Musser

unread,

Jun 25, 2019, 3:37:00 AM6/25/19

to rna-star

Alex,

Thanks for looking into this. Please let me know if I can help in anyway (e.g. sending you some data to reproduce it). I think there are a growing number of us in evo-devo that would be very keen for your pipeline to work.

I did have one thought about the gene counting. In the case of a genome, transcripts for the same gene will be on the same contig. For mapping to a transcriptome though, transcripts from the same gene will be on distinct contigs. I don't know how your counting algorithm works, but I thought this was worth considering.

Jake

Alexander Dobin

unread,

Jul 22, 2019, 4:29:33 PM7/22/19

to rna-star

Hi Jacob,

I have found the problem - the Unmapped reads were counted as nNoFeature.

It is fixed now on this branch:

https://github.com/alexdobin/STAR/tree/soloDevelop

However, the old version's only problem is the nNoFeature number, all other outputs are correct.

I have run STARsolo aligning to the transriptome with --aligIntronMax 1 (prohibiting splicing) --soloStrand Unstranded, and got the expected result:

1. nNoFeature=0

2. Gene and GeneFull stats and matrices are exactly the same.

If you try it out, please let me know if this solves the problem for your cases.

Cheers

Alex

Jacob Musser

unread,

Aug 14, 2019, 12:54:55 PM8/14/19

to rna-star

Alex,

Thanks for looking into this. I will start working with it again and let you know how it goes.