A lot of unmapped reads

1,065 views
Skip to first unread message

Mehmet Ahsen

unread,
Jun 23, 2017, 9:59:10 AM6/23/17
to rna-star
Hi All,

Thanks for the great feedback recently I run star on some human data and the output log shows as follows:


                   Uniquely mapped reads number |       47514

                        Uniquely mapped reads % |       0.91%

                          Average mapped length |       115.15

                       Number of splices: Total |       16777

            Number of splices: Annotated (sjdb) |       16510

                       Number of splices: GT/AG |       16671

                       Number of splices: GC/AG |       67

                       Number of splices: AT/AC |       12

               Number of splices: Non-canonical |       27

                      Mismatch rate per base, % |       0.56%

                         Deletion rate per base |       0.02%

                        Deletion average length |       1.43

                        Insertion rate per base |       0.01%

                       Insertion average length |       1.25

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       12401

             % of reads mapped to multiple loci |       0.24%

        Number of reads mapped to too many loci |       208

             % of reads mapped to too many loci |       0.00%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.00%

                 % of reads unmapped: too short |       98.85%

                     % of reads unmapped: other |       0.01%



As you see I have a lot of unmapped reads:too short. Is there a way to improve the alignment. I am a newbie in alignment so please be considerate in your replies and suggestions :) thanks a lot in advance.

Alexander Dobin

unread,
Jun 23, 2017, 3:43:34 PM6/23/17
to rna-star
Hi Mehmet,

such a catastrophically low mapping rate is most likely caused by some problem with input fastq file.
Are your reads paired-end? Have you trimmed them before mapping?
Please send me the Log.out file of your run.

Cheers
Alex

Mehmet Ahsen

unread,
Jun 26, 2017, 11:02:21 AM6/26/17
to rna-star
Hi Alex,

Thanks for the quick reply. I believe these are not paired reads. I am attaching the logout file for this catastrophical sample. I am also attaching another Log file with named Log_other.out. This is also from the same experiment which though not as high has the following final issues:

                   Uniquely mapped reads number |       3294394

                        Uniquely mapped reads % |       54.46%

                          Average mapped length |       114.22

                       Number of splices: Total |       12629

            Number of splices: Annotated (sjdb) |       3888

                       Number of splices: GT/AG |       11582

                       Number of splices: GC/AG |       290

                       Number of splices: AT/AC |       50

               Number of splices: Non-canonical |       707

                      Mismatch rate per base, % |       0.79%

                         Deletion rate per base |       0.03%

                        Deletion average length |       1.35

                        Insertion rate per base |       0.01%

                       Insertion average length |       1.36

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       199827

             % of reads mapped to multiple loci |       3.30%

        Number of reads mapped to too many loci |       21770

             % of reads mapped to too many loci |       0.36%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.00%

                 % of reads unmapped: too short |       41.52%

                     % of reads unmapped: other |       0.36%



Our issues with this sample is that although more reads are aligned, almost all 90% of the reads aligned map to unannotated regions, any reason why this could be?

Log.out
Log_new.out

Alexander Dobin

unread,
Jun 26, 2017, 4:36:37 PM6/26/17
to rna-star
Hi Mehmet,

a couple of suggestions to check for unnannotated reads:
1. Do the "unannotated" reads map mostly in the introns or intergenic space?
2. Do they map uniformly over genome (DNA contamination) or in big "lumps"?

for unmapped reads:
1. Try to map only first ~50b of the reads to see if the tails have poor quality.
2. Try to BLAST unmapped reads.

Cheers
Alex

MA1789

unread,
Jun 26, 2017, 10:51:47 PM6/26/17
to rna-star
Hi Alex,

I will try these suggestions, just to make sure have you seen any inconsistencies in the log files?

Alexander Dobin

unread,
Jun 29, 2017, 1:17:04 PM6/29/17
to rna-star
Hi Mehmet,

no, there seems to be nothing wrong with the Log files.

Cheers
Alex

MA1789

unread,
Jul 4, 2017, 5:27:21 PM7/4/17
to rna-star
Hi Alex,

Based on your suggestion i used the program RSeQC (do you have other suggestions for doing this?) for mapping distribution and I obtained the following stats:
I am not sure how to interpret this it looks like most things are mapped to introns could this be a DNA contamination issue?

Total Reads                   3494221

Total Tags                    3633930

Total Assigned Tags           2128368

=====================================================================

Group               Total_bases         Tag_count           Tags/Kb

CDS_Exons           36747178            33638               0.92

5'UTR_Exons         15904155            10135               0.64

3'UTR_Exons         38399562            66878               1.74

Introns             1266130785          1694894             1.34

TSS_up_1kb          20020387            12827               0.64

TSS_up_5kb          89078077            92740               1.04

TSS_up_10kb         162561801           174442              1.07

TES_down_1kb        20872855            17666               0.85

TES_down_5kb        88857692            82355               0.93

TES_down_10kb       157692119           148381              0.94

=====================================================================

Alexander Dobin

unread,
Jul 5, 2017, 3:43:48 PM7/5/17
to rna-star
Hi Mehmet,

I heard good things about RSEQC, but have not used it  myself.
Another good package is bedtools, which has tools to calculate overlap between reads and annotations.
If I read the RSEQC output correctly, 1694894/3633930=47% of reads map to the introns and  (3633930-2128368)/3633930=41% map to the intergenic space (Total - Assigned tags). The high % of intergenic tags may indicate DNA contamination. You can load the BAM file into a browser (UCSC Browser or IGV) and  check whether the reads cover the genome uniformly - that would indicate DNA contamination. On the other hand, if the reads make lumps around genes and "localized" unannotated loci - then it's likely not the DNA contamination.

You can also BLAST the unmapped reads to check for contamination from other species.

In any case, this is a very strange looking RNA-seq library. Were these samples unusual in any way and what library prep was used?

Cheers
Alex

MA1789

unread,
Jul 5, 2017, 9:07:17 PM7/5/17
to rna-star
Hi Alex,

We are working on a very experimental technology for liquid biopsy so these rna are from urine (might be why there are a lot of unmapped reads)  so that why I want to make sure whether there is a technical artifact or what we are seeing something interesting maybe unspliced RNA etc.

Eren

Alexander Dobin

unread,
Jul 7, 2017, 12:15:58 PM7/7/17
to rna-star
Hi Eren,

if these are "non-trivial" samples, I would strongly advice to make a few browser tracks (wiggle, BAM, splice junctions) and look at your results in the browser.
It's easier to discern interesting patterns with you eyes first, e.g. do reads accumulate near genes? do they cover introns? do you see highly expressed inter-genic loci (novel RNA)? etc.

For unmapped reads, BLASTing is the firs tthing to do.

Cheers
Alex

MA1789

unread,
Jul 9, 2017, 11:02:24 AM7/9/17
to rna-star
Hi Alex,

Thanks for suggestions, yes i did blast random 2000 reads and 60 percent of them got a significant hit to bacteria that was related to the disease we are investiagating. I will next try to visualize them as you suggested and will let you know.

Thanks

Eren

MA1789

unread,
Jul 11, 2017, 6:08:11 PM7/11/17
to rna-star
Hi Alex,

I am attaching some screenshots of the IGV. It seems that things are accumulated in some regions. I am attaching some intergenic, intronic and exonic alignment. I wonder what is your interpretation of these, is this expected?
snapshot_exon.png
snapshot_intron.png
snapshot_intergenic.png

Alexander Dobin

unread,
Jul 12, 2017, 1:01:08 PM7/12/17
to rna-star
Hi Eren,

this does not look at all like RNA-seq of normal cell cultures or tissues, where you should see exons covered with reads.
As always, there are two options - it could be a very interesting biological finding, or it could be an artifact of RNA extraction/library prep. :)
If you have multiple samples, I would try to see if there is consistency between these transcripted loci.
You may also want to summarize the data over the whole genome. It seems like you would need to call "transfrags" or "TARs", i.e. contiguous 
 transcribed regions, and quantify them with the number of overlapping reads. Then you can calculate how many of transfrags are exonic/intronic/intergenic,
how many are close to the annotated TSS/TTS etc.

Of course, some kind of alternative validation would have to be done, such as Northern, qPCR etc.

Cheers
Alex

MA1789

unread,
Jul 25, 2017, 12:34:33 PM7/25/17
to rna-star
Hi Alex,

Thanks for the great suggestion. IF the model I am studying carries mostly microRNA's do you think it makes sense to see such repeated reads. Is it possible for me to quantify what percent of the reads mapped to microRNA's using STAR?

Thanks
Eren

Alexander Dobin

unread,
Jul 26, 2017, 5:31:17 PM7/26/17
to rna-star
Hi Eren

mature miRNAs are ~22nt long, and they normally do not get sequenced by the long RNA-seq protocols.
The "contigs" you are seeing are much longer. In principle, they could be precursors, but - unless they overlap known miRNA- you would need small RNA-seq data to prove that.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages