A lot of unmapped reads

Mehmet Ahsen

unread,

Jun 23, 2017, 9:59:10 AM6/23/17

to rna-star

Hi All,

Thanks for the great feedback recently I run star on some human data and the output log shows as follows:

Uniquely mapped reads number | 47514

Uniquely mapped reads % | 0.91%

Average mapped length | 115.15

Number of splices: Total | 16777

Number of splices: Annotated (sjdb) | 16510

Number of splices: GT/AG | 16671

Number of splices: GC/AG | 67

Number of splices: AT/AC | 12

Number of splices: Non-canonical | 27

Mismatch rate per base, % | 0.56%

Deletion rate per base | 0.02%

Deletion average length | 1.43

Insertion rate per base | 0.01%

Insertion average length | 1.25

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 12401

% of reads mapped to multiple loci | 0.24%

Number of reads mapped to too many loci | 208

% of reads mapped to too many loci | 0.00%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 98.85%

% of reads unmapped: other | 0.01%

As you see I have a lot of unmapped reads:too short. Is there a way to improve the alignment. I am a newbie in alignment so please be considerate in your replies and suggestions :) thanks a lot in advance.

Alexander Dobin

unread,

Jun 23, 2017, 3:43:34 PM6/23/17

to rna-star

Hi Mehmet,

such a catastrophically low mapping rate is most likely caused by some problem with input fastq file.

Are your reads paired-end? Have you trimmed them before mapping?

Please send me the Log.out file of your run.

Cheers

Alex

Mehmet Ahsen

unread,

Jun 26, 2017, 11:02:21 AM6/26/17

to rna-star

Hi Alex,

Thanks for the quick reply. I believe these are not paired reads. I am attaching the logout file for this catastrophical sample. I am also attaching another Log file with named Log_other.out. This is also from the same experiment which though not as high has the following final issues:

Uniquely mapped reads number | 3294394

Uniquely mapped reads % | 54.46%

Average mapped length | 114.22

Number of splices: Total | 12629

Number of splices: Annotated (sjdb) | 3888

Number of splices: GT/AG | 11582

Number of splices: GC/AG | 290

Number of splices: AT/AC | 50

Number of splices: Non-canonical | 707

Mismatch rate per base, % | 0.79%

Deletion rate per base | 0.03%

Deletion average length | 1.35

Insertion rate per base | 0.01%

Insertion average length | 1.36

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 199827

% of reads mapped to multiple loci | 3.30%

Number of reads mapped to too many loci | 21770

% of reads mapped to too many loci | 0.36%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 41.52%

% of reads unmapped: other | 0.36%

Our issues with this sample is that although more reads are aligned, almost all 90% of the reads aligned map to unannotated regions, any reason why this could be?

Log.out

Log_new.out

Alexander Dobin

unread,

Jun 26, 2017, 4:36:37 PM6/26/17

to rna-star

Hi Mehmet,

a couple of suggestions to check for unnannotated reads:

1. Do the "unannotated" reads map mostly in the introns or intergenic space?

2. Do they map uniformly over genome (DNA contamination) or in big "lumps"?

for unmapped reads:

1. Try to map only first ~50b of the reads to see if the tails have poor quality.

2. Try to BLAST unmapped reads.

Cheers

Alex

MA1789

unread,

Jun 26, 2017, 10:51:47 PM6/26/17

to rna-star

Hi Alex,

I will try these suggestions, just to make sure have you seen any inconsistencies in the log files?

Alexander Dobin

unread,

Jun 29, 2017, 1:17:04 PM6/29/17

to rna-star

Hi Mehmet,

no, there seems to be nothing wrong with the Log files.

Cheers

Alex

MA1789

unread,

Jul 4, 2017, 5:27:21 PM7/4/17

to rna-star

Hi Alex,

Based on your suggestion i used the program RSeQC (do you have other suggestions for doing this?) for mapping distribution and I obtained the following stats:

I am not sure how to interpret this it looks like most things are mapped to introns could this be a DNA contamination issue?

Total Reads 3494221

Total Tags 3633930

Total Assigned Tags 2128368

=====================================================================

Group Total_bases Tag_count Tags/Kb

CDS_Exons 36747178 33638 0.92

5'UTR_Exons 15904155 10135 0.64

3'UTR_Exons 38399562 66878 1.74

Introns 1266130785 1694894 1.34

TSS_up_1kb 20020387 12827 0.64

TSS_up_5kb 89078077 92740 1.04

TSS_up_10kb 162561801 174442 1.07

TES_down_1kb 20872855 17666 0.85

TES_down_5kb 88857692 82355 0.93

TES_down_10kb 157692119 148381 0.94

=====================================================================

Alexander Dobin

unread,

Jul 5, 2017, 3:43:48 PM7/5/17

to rna-star

Hi Mehmet,

I heard good things about RSEQC, but have not used it myself.

Another good package is bedtools, which has tools to calculate overlap between reads and annotations.

If I read the RSEQC output correctly, 1694894/3633930=47% of reads map to the introns and (3633930-2128368)/3633930=41% map to the intergenic space (Total - Assigned tags). The high % of intergenic tags may indicate DNA contamination. You can load the BAM file into a browser (UCSC Browser or IGV) and check whether the reads cover the genome uniformly - that would indicate DNA contamination. On the other hand, if the reads make lumps around genes and "localized" unannotated loci - then it's likely not the DNA contamination.

You can also BLAST the unmapped reads to check for contamination from other species.

In any case, this is a very strange looking RNA-seq library. Were these samples unusual in any way and what library prep was used?

Cheers

Alex

MA1789

unread,

Jul 5, 2017, 9:07:17 PM7/5/17

to rna-star

Hi Alex,

We are working on a very experimental technology for liquid biopsy so these rna are from urine (might be why there are a lot of unmapped reads) so that why I want to make sure whether there is a technical artifact or what we are seeing something interesting maybe unspliced RNA etc.

Eren

Alexander Dobin

unread,

Jul 7, 2017, 12:15:58 PM7/7/17

to rna-star

Hi Eren,

if these are "non-trivial" samples, I would strongly advice to make a few browser tracks (wiggle, BAM, splice junctions) and look at your results in the browser.

It's easier to discern interesting patterns with you eyes first, e.g. do reads accumulate near genes? do they cover introns? do you see highly expressed inter-genic loci (novel RNA)? etc.

For unmapped reads, BLASTing is the firs tthing to do.

Cheers

Alex

MA1789

unread,

Jul 9, 2017, 11:02:24 AM7/9/17

to rna-star

Hi Alex,

Thanks for suggestions, yes i did blast random 2000 reads and 60 percent of them got a significant hit to bacteria that was related to the disease we are investiagating. I will next try to visualize them as you suggested and will let you know.

Thanks

Eren

MA1789

unread,

Jul 11, 2017, 6:08:11 PM7/11/17

to rna-star

Hi Alex,

I am attaching some screenshots of the IGV. It seems that things are accumulated in some regions. I am attaching some intergenic, intronic and exonic alignment. I wonder what is your interpretation of these, is this expected?

snapshot_exon.png

snapshot_intron.png

snapshot_intergenic.png

Alexander Dobin

unread,

Jul 12, 2017, 1:01:08 PM7/12/17

to rna-star

Hi Eren,

this does not look at all like RNA-seq of normal cell cultures or tissues, where you should see exons covered with reads.

As always, there are two options - it could be a very interesting biological finding, or it could be an artifact of RNA extraction/library prep. :)

If you have multiple samples, I would try to see if there is consistency between these transcripted loci.

You may also want to summarize the data over the whole genome. It seems like you would need to call "transfrags" or "TARs", i.e. contiguous

transcribed regions, and quantify them with the number of overlapping reads. Then you can calculate how many of transfrags are exonic/intronic/intergenic,

how many are close to the annotated TSS/TTS etc.

Of course, some kind of alternative validation would have to be done, such as Northern, qPCR etc.

Cheers

Alex

MA1789

unread,

Jul 25, 2017, 12:34:33 PM7/25/17

to rna-star

Hi Alex,

Thanks for the great suggestion. IF the model I am studying carries mostly microRNA's do you think it makes sense to see such repeated reads. Is it possible for me to quantify what percent of the reads mapped to microRNA's using STAR?

Thanks

Eren

Alexander Dobin

unread,

Jul 26, 2017, 5:31:17 PM7/26/17

to rna-star

Hi Eren

mature miRNAs are ~22nt long, and they normally do not get sequenced by the long RNA-seq protocols.

The "contigs" you are seeing are much longer. In principle, they could be precursors, but - unless they overlap known miRNA- you would need small RNA-seq data to prove that.