Majority of reads belong to "N

Zhen Shi

unread,

May 5, 2016, 9:49:02 AM5/5/16

to rna-star

Hi here,

I am mapping Illumina TruSeq 2x76 bp PE reads to the mouse genome. I build the genome index with this argument

STAR --runThreadN 12 --runMode genomeGenerate --genomeDir $WORKDIR --genomeFastaFiles Mus_musculus.GRCm38.dna.primary_assembly.fa --sjdbGTFfile gencode.vM9.primary_assembly.annotation.gtf --sjdbOverhang 75

Then for mapping (I only wanna analyze unique reads)

PARAMS="--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 30 --outFilterMultimapNmax 1 --outFilterIntronMotifs RemoveNoncanonical --readFilesCommand zcat --outSAMtype BAM Unsorted SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --outWigType bedGraph read1_5p --outReadsUnmapped Fastx"

STAR --readFilesIn ZS1-WTR0-CTTGTA_S8_R1_001.fastq.gz ZS1-WTR0-CTTGTA_S8_R2_001.fastq.gz --genomeDir $GENOME $PARAMS --outFileNamePrefix ZS1-WTR0.

From Log.final.out it seems all went well, however there are very few lines in the ReadsPerGene.out.tab file. Then I look at the first few lines:

head *ReadsPerGene.out.tab

N_unmapped 9521047 9521047 9521047

N_multimapping 0 0 0

N_noFeature 53548099 53570221 53548309

N_ambiguous 1 0 0

ENSMUSG00000079800.2 0 0 1

ENSMUSG00000095092.1 0 0 0

ENSMUSG00000079192.2 120 1 120

ENSMUSG00000079794.2 2 2 0

ENSMUSG00000094799.1 0 0 0

ENSMUSG00000095250.1 0 0 0

It's strange that the vast majority belong to "N_noFeature", however I feel the mapping went ok cuz ~85% reads map uniquely to the genome. This is how the Log.final.out looks:

Started job on | May 04 15:01:33

Started mapping on | May 04 15:33:32

Finished on | May 04 17:28:16

Mapping speed, Million of reads per hour | 32.99

Number of input reads | 63091479

Average input read length | 150

UNIQUE READS:

Uniquely mapped reads number | 53570432

Uniquely mapped reads % | 84.91%

Average mapped length | 149.67

Number of splices: Total | 18307132

Number of splices: Annotated (sjdb) | 12310

Number of splices: GT/AG | 18180161

Number of splices: GC/AG | 112361

Number of splices: AT/AC | 14610

Number of splices: Non-canonical | 0

Mismatch rate per base, % | 0.20%

Deletion rate per base | 0.01%

Deletion average length | 1.39

Insertion rate per base | 0.01%

Insertion average length | 1.19

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 0

% of reads mapped to multiple loci | 0.00%

Number of reads mapped to too many loci | 5338404

% of reads mapped to too many loci | 8.46%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.09%

% of reads unmapped: too short | 6.41%

% of reads unmapped: other | 0.13%

CHIMERIC READS:

Number of chimeric reads | 0

% of chimeric reads | 0.00%

May I ask did I do something wrong?

Thank you so much! Any suggestion appreciated!

-Zhen

Alexander Dobin

unread,

May 5, 2016, 10:50:19 AM5/5/16

to rna-star

Hi Zhen,

this is strange indeed. Have you looked at the wiggle files in the browser? Does it look like most of the single does not overlap annotated exons?

I would recommend that you map a public mouse dataset to check your pipeline - you should see the majority of reads overlapping the genes, and <20% "noFeature" .

Cheers

Alex

Zhen Shi

unread,

May 12, 2016, 6:38:47 PM5/12/16

to rna-star

Hi Alex,

I figured out why: so the .fa sequence file from Ensembl and .gtf file from Gencode has different nomenclatures.

Ensembl chromosome coordinates use ‘1’ ‘2’….whereas Gencode uses ‘chr1’’chr2’

I fixed this and now the results look good.

Thanks again,

-Zhen

Reply all

Reply to author

Forward

Majority of reads belong to "N_noFeature"

Zhen Shi

Alexander Dobin

Zhen Shi