Majority of reads belong to "N_noFeature"

179 views
Skip to first unread message

Zhen Shi

unread,
May 5, 2016, 9:49:02 AM5/5/16
to rna-star
Hi here,
I am mapping Illumina TruSeq 2x76 bp PE reads to the mouse genome. I build the genome index with this argument
STAR --runThreadN 12 --runMode genomeGenerate --genomeDir $WORKDIR --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa --sjdbGTFfile gencode.vM9.primary_assembly.annotation.gtf --sjdbOverhang 75


Then for mapping (I only wanna analyze unique reads)

PARAMS="--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 30 --outFilterMultimapNmax 1 --outFilterIntronMotifs RemoveNoncanonical --readFilesCommand zcat --outSAMtype BAM Unsorted SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --outWigType bedGraph read1_5p --outReadsUnmapped Fastx"

STAR --readFilesIn ZS1-WTR0-CTTGTA_S8_R1_001.fastq.gz ZS1-WTR0-CTTGTA_S8_R2_001.fastq.gz --genomeDir $GENOME $PARAMS --outFileNamePrefix ZS1-WTR0.


From Log.final.out it seems all went well, however there are very few lines in the ReadsPerGene.out.tab file. Then I look at the first few lines:
head *ReadsPerGene.out.tab


N_unmapped 9521047 9521047 9521047

N_multimapping 0 0 0

N_noFeature 53548099 53570221 53548309

N_ambiguous 1 0 0

ENSMUSG00000079800.2 0 0 1

ENSMUSG00000095092.1 0 0 0

ENSMUSG00000079192.2 120 1 120

ENSMUSG00000079794.2 2 2 0

ENSMUSG00000094799.1 0 0 0

ENSMUSG00000095250.1 0 0 0



It's strange that the vast majority belong to  "N_noFeature", however I feel the mapping went ok cuz ~85% reads map uniquely to the genome. This is how the Log.final.out looks:

                                 Started job on | May 04 15:01:33

                             Started mapping on | May 04 15:33:32

                                    Finished on | May 04 17:28:16

       Mapping speed, Million of reads per hour | 32.99


                          Number of input reads | 63091479

                      Average input read length | 150

                                    UNIQUE READS:

                   Uniquely mapped reads number | 53570432

                        Uniquely mapped reads % | 84.91%

                          Average mapped length | 149.67

                       Number of splices: Total | 18307132

            Number of splices: Annotated (sjdb) | 12310

                       Number of splices: GT/AG | 18180161

                       Number of splices: GC/AG | 112361

                       Number of splices: AT/AC | 14610

               Number of splices: Non-canonical | 0

                      Mismatch rate per base, % | 0.20%

                         Deletion rate per base | 0.01%

                        Deletion average length | 1.39

                        Insertion rate per base | 0.01%

                       Insertion average length | 1.19

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci | 0

             % of reads mapped to multiple loci | 0.00%

        Number of reads mapped to too many loci | 5338404

             % of reads mapped to too many loci | 8.46%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches | 0.09%

                 % of reads unmapped: too short | 6.41%

                     % of reads unmapped: other | 0.13%

                                  CHIMERIC READS:

                       Number of chimeric reads | 0

                            % of chimeric reads | 0.00%



May I ask did I do something wrong? 

Thank you so much! Any suggestion appreciated!

-Zhen


Alexander Dobin

unread,
May 5, 2016, 10:50:19 AM5/5/16
to rna-star
Hi Zhen,

this is strange indeed. Have you looked at the wiggle files in the browser? Does it look like most of the single does not overlap annotated exons?
I would recommend that you map a public mouse dataset to check your pipeline -  you should see the majority of reads overlapping the genes, and <20% "noFeature" .

Cheers
Alex

Zhen Shi

unread,
May 12, 2016, 6:38:47 PM5/12/16
to rna-star
Hi Alex,
I figured out why: so the .fa sequence file from Ensembl and .gtf file from Gencode has different nomenclatures.  
Ensembl chromosome coordinates use ‘1’ ‘2’….whereas Gencode uses ‘chr1’’chr2’    

I fixed this and now the results look good. 
Thanks again,
-Zhen
Reply all
Reply to author
Forward
0 new messages