Interpreting STARsolo summary statistics when mapping 10x data to transcriptome

1,042 views
Skip to first unread message

Jacob Musser

unread,
Jun 18, 2019, 6:08:56 AM6/18/19
to rna-star
Hello,

I was very excited to see the new functionality for mapping droplet scRNAseq data in STAR. I have been testing this with some of our 10x data collected from a species of mollusc. The reference we are using is a de novo transcriptome, in which every transcript is a separate sequence in the fasta reference file, and the entire length of each transcript sequence is annotated as "exon" in the gtf file. So, there are no introns, and if a read uniquely maps, then it should count as being mapped to a gene.

In the summary files (e.g. Gene.stats), I see a substantial number of reads that are "nNoFeature". Shouldn't this be theoretically impossible in my situation, or am I misunderstanding what this category represents? Also, I observe different summary statistic numbers in Gene.stats and GeneFull.stats, although I don't have any sequence labelled as introns. Can you offer your thoughts on what would account for this? I pasted the STAR run command, Gene.stats, and GeneFull.stats below if this is helpful.

Sincerely,
Jake

 

STAR run command:
STAR --runThreadN 12 --soloType Droplet --soloStrand Unstranded --soloFeatures Gene GeneFull --soloCBwhitelist 737K-august-2016.txt --genomeDir gtf_final.gtf --sjdbGTFfeatureExon exon --readFilesCommand gunzip -c --readFilesIn AcrB_S2_L001_R2_001.fastq.gz AcrB_S2_L001_R1_001.fastq.gz


Gene.stats
                                        Barcodes:
                                       nNinBarcode          20631
                                   nUMIhomopolymer          34843
                                          nTooMany              0
                                          nNoMatch        7566302
                                              Gene:
                                         nUnmapped          82691
                                        nNoFeature       52950640
                                     nAmbigFeature          24612
                             nAmbigFeatureMultimap          24612
                                          nTooMany          82491
                                     nNoExactMatch          83272
                                       nExactMatch       89367182
                                            nMatch       91095117
                                     nCellBarcodes         267078
                                             nUMIs       12051808


GeneFull.stats
                                        Barcodes:
                                       nNinBarcode          20631
                                   nUMIhomopolymer          34843
                                          nTooMany              0
                                          nNoMatch        7566302
                                          GeneFull:
                                         nUnmapped          82691
                                        nNoFeature       52846572
                                     nAmbigFeature          24654
                             nAmbigFeatureMultimap          24654
                                          nTooMany          82573
                                     nNoExactMatch          83462
                                       nExactMatch       89469225
                                            nMatch       91198871
                                     nCellBarcodes         267139
                                             nUMIs       12058357

Alexander Dobin

unread,
Jun 18, 2019, 7:15:04 PM6/18/19
to rna-star
Hi Jacob,

I have not tested STARsolo with the transcriptome alignments... so you would need to be alpha-tester on that. :)
Could you also post the Log.final.out?
Does each exon in the GTF file have unique "transcript_id" and "gene_id"?
If you are mapping to the transcriptome, you may want to prohibit splicing, with --alignIntronMax 1 .
The spliced reads within each gene will be counted as "nNoFeature" for the "Gene" counts - which only count reads that agree with the transcript structure - which is always "unspliced" in your case.
However, all mapped reads should be counted in the GeneFull counts...

Cheers
Alex

Jacob Musser

unread,
Jun 19, 2019, 3:33:39 PM6/19/19
to rna-...@googlegroups.com
Alex,

No problem, happy to be an alpha-tester :).

For the GTF, there can be several transcript_ids per gene_id. Unless I misunderstand the STAR manual, your quantification at the gene level should count a read for a gene as long as it uniquely maps to a single gene_ID (even if it overlaps multiple transcripts with that geneID). Is this correct?

I am rerunning it now with --alignIntroMax, which I'm guessing should result in equivalent stats for Gene andGeneFull. I'll let you know what I get though.

Below is the Log.final.out from the original run:

                                 Started job on | Jun 18 00:26:09
                             Started mapping on | Jun 18 00:26:15
                                    Finished on | Jun 18 01:50:17
       Mapping speed, Million of reads per hour | 108.49

                          Number of input reads | 151940599
                      Average input read length | 120
                                    UNIQUE READS:
                   Uniquely mapped reads number | 63724056
                        Uniquely mapped reads % | 41.94%
                          Average mapped length | 108.58
                       Number of splices: Total | 69015
            Number of splices: Annotated (sjdb) | 0
                       Number of splices: GT/AG | 20127
                       Number of splices: GC/AG | 2029
                       Number of splices: AT/AC | 776
               Number of splices: Non-canonical | 46083
                      Mismatch rate per base, % | 1.78%
                         Deletion rate per base | 0.08%
                        Deletion average length | 2.04
                        Insertion rate per base | 0.08%
                       Insertion average length | 1.86
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 31518606
             % of reads mapped to multiple loci | 20.74%
        Number of reads mapped to too many loci | 1245837
             % of reads mapped to too many loci | 0.82%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches | 0
       % of reads unmapped: too many mismatches | 0.00%
            Number of reads unmapped: too short | 55347055
                 % of reads unmapped: too short | 36.43%
                Number of reads unmapped: other | 105045
                     % of reads unmapped: other | 0.07%
                                  CHIMERIC READS:
                       Number of chimeric reads | 0
                            % of chimeric reads | 0.00%

Alexander Dobin

unread,
Jun 24, 2019, 11:08:46 AM6/24/19
to rna-star
Hi Jacob,

>>> "quantification at the gene level should count a read for a gene as long as it uniquely maps to a single gene_ID (even if it overlaps multiple transcripts with that geneID)."
This is correct.

The Log.final.out looks reasonable, the number of spliced reads is low and cannot explain your observations. I will make some test on my own examples to see if I can reproduce the problems.

Cheers
Alex

Jacob Musser

unread,
Jun 25, 2019, 3:37:00 AM6/25/19
to rna-star
Alex,

Thanks for looking into this. Please let me know if I can help in anyway (e.g. sending you some data to reproduce it). I think there are a growing number of us in evo-devo that would be very keen for your pipeline to work.

I did have one thought about the gene counting. In the case of a genome, transcripts for the same gene will be on the same contig. For mapping to a transcriptome though,  transcripts from the same gene will be on distinct contigs. I don't know how your counting algorithm works, but I thought this was worth considering.

Jake

Alexander Dobin

unread,
Jul 22, 2019, 4:29:33 PM7/22/19
to rna-star
Hi Jacob,

I have found the problem - the Unmapped reads were counted as nNoFeature.
It is fixed now on this branch:
However, the old version's only problem is the nNoFeature number, all other outputs are correct.

I have run STARsolo aligning to the transriptome with --aligIntronMax 1 (prohibiting splicing) --soloStrand Unstranded, and got the expected result:
1. nNoFeature=0
2. Gene and GeneFull stats and matrices are exactly the same.

If you try it out, please let me know if this solves the problem for your cases.

Cheers
Alex

Jacob Musser

unread,
Aug 14, 2019, 12:54:55 PM8/14/19
to rna-star
Alex,

Thanks for looking into this. I will start working with it again and let you know how it goes.

Cheers,
Jake
Reply all
Reply to author
Forward
0 new messages