High % of reads unmapped: too short. Could it be caused by short insert size in library?

6,407 views
Skip to first unread message

Tom Harrop

unread,
Jun 12, 2015, 12:46:10 PM6/12/15
to rna-...@googlegroups.com
Hi everyone,

I am getting a high percentage of "reads unmapped: too short" (25–35%) when mapping some paired-end Illumina reads.

The reads are 2x125 bp so ~110 bp each after trimming etc.

Here is how I used STAR:
STAR --runThreadN 4 --genomeDir $index_dir \
   
--readFilesIn $R1 $R2 --outFileNamePrefix $path_to_out/$library_name. \
   
--outSAMtype BAM Unsorted

I have also (separately) tried the parameters:
--alignSplicedMateMapLmin 50
and
--alignSplicedMateMapLminOverLmate 0.2
but this doesn't make a difference.

My mean insert size is only about 130 bp as determined by bwa-mem (and in agreement with the Bioanalyzer traces) so I expect a large amount of overlap between the read pairs. I wonder if this could be part of my problem, since STAR is reporting the read length as 225 bp?

The genome was created with:
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir $path_to_out/star/index \
   
--genomeFastaFiles $fasta --sjdbGTFfile $gff \
   
--sjdbGTFtagExonParentTranscript Parent --sjdbOverhang 109

I am using STAR 2.4.1d from github:
STAR_2.4.1d_modified

I would be grateful for any suggestions. FWIW I get a high percentage (~95%) of concordant mapping with tophat2 for the same libraries, but I don't want to use tophat2.

Thanks for reading,

Tom Harrop
IRD, Montpellier, France.

Alexander Dobin

unread,
Jun 12, 2015, 6:09:08 PM6/12/15
to rna-...@googlegroups.com, twha...@gmail.com
Hi Tom,

the "input read length" of 225b is the mean read length, which agrees with 2x ~110b after trimming.
Could you send me the Log.final.out file - it contains useful mapping statistics?

Cheers
Alex

Tom Harrop

unread,
Jun 13, 2015, 1:19:54 PM6/13/15
to rna-...@googlegroups.com, twha...@gmail.com
Hi Alex,

Thanks for the reply. I'm pasting a Log.final.out file at the end of this post.

It makes sense that STAR's input read length is the sum of the lengths of R1 and R2. But the reason I brought it up is because I wonder having a library with a short insert size and relatively long reads might cause problems for certain calculations. For example, for a 130 bp fragment with ~112 b sequenced from each end, if the 'alignSplicedMateMapLminOverLmate' parameter, which defaults to 0.6, uses 225 as the Lmate to calculate a cutoff of 0.6 × 225 b = 135 b then I wonder if it would reject both of the reads even if they map perfectly?

Thanks again,

Tom

                          Number of input reads | 40059196
                      Average input read length | 225
                                    UNIQUE READS:
                   Uniquely mapped reads number | 27299013
                        Uniquely mapped reads % | 68.15%
                          Average mapped length | 228.08
                       Number of splices: Total | 14823607
            Number of splices: Annotated (sjdb) | 13537993
                       Number of splices: GT/AG | 14519438
                       Number of splices: GC/AG | 164417
                       Number of splices: AT/AC | 18779
               Number of splices: Non-canonical | 120973
                      Mismatch rate per base, % | 0.50%
                         Deletion rate per base | 0.04%
                        Deletion average length | 2.08
                        Insertion rate per base | 0.04%
                       Insertion average length | 1.48
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 3062416
             % of reads mapped to multiple loci | 7.64%
        Number of reads mapped to too many loci | 156976
             % of reads mapped to too many loci | 0.39%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 23.73%
                     % of reads unmapped: other | 0.09%

Alexander Dobin

unread,
Jun 17, 2015, 4:07:40 PM6/17/15
to rna-...@googlegroups.com, twha...@gmail.com
Hi Tom,

unlike most other parameters, --alignSplicedMateMapLminOverLmate deals with each of the mates separately. So the limit will 0.6*112=67b, which should be OK for most reads.

The Log.final.out looks normal, with low mismatch rate (0.5%), low indel rates, not very high multi-mapping rate, reasonable number of splices.
Generally, STAR maps more reads than TopHat, so it seems I will need to have a closer look at your data. Could you privately share a 100,000 reads (a representative subset, i.e. with the same mapping rates as the full set), as well as the link to the genome fasta/gff?

Cheers
Alex

Павел Заякин

unread,
Jun 25, 2015, 10:18:24 AM6/25/15
to rna-...@googlegroups.com
Hi Alex,

I probably have very similar problem with my data from Ion Torrent Proton.
My  Log.final.out file:

                                 Started job on | Jun 19 15:22:23
                             Started mapping on | Jun 19 15:28:32
                                    Finished on | Jun 19 15:31:06
       Mapping speed, Million of reads per hour | 160.42

                          Number of input reads | 6862574
                      Average input read length | 116
                                    UNIQUE READS:
                   Uniquely mapped reads number | 1385770
                        Uniquely mapped reads % | 20.19%
                          Average mapped length | 91.40
                       Number of splices: Total | 561239
            Number of splices: Annotated (sjdb) | 519213
                       Number of splices: GT/AG | 510368
                       Number of splices: GC/AG | 14875
                       Number of splices: AT/AC | 333
               Number of splices: Non-canonical | 35663
                      Mismatch rate per base, % | 0.90%
                         Deletion rate per base | 0.27%
                        Deletion average length | 1.13
                        Insertion rate per base | 0.28%
                       Insertion average length | 1.13
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 1003532
             % of reads mapped to multiple loci | 14.62%
        Number of reads mapped to too many loci | 2669
             % of reads mapped to too many loci | 0.04%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.04%
                 % of reads unmapped: too short | 64.24%
                     % of reads unmapped: other | 0.87%

Can you, please, comment this case?
Thank you for help,
Pawel 

Alexander Dobin

unread,
Jun 25, 2015, 12:46:49 PM6/25/15
to rna-...@googlegroups.com, pa...@biomed.lu.lv
Hi Pawel,

at a first glance it looks like a different problem. In Tom's case the reads were paired-end and the mates overlapped strongly. 
In you case, it looks like the quality of the read ends is bad, since ~25 bases have to be trimmed for the reads that could be mapped.
I would recommend relaxing the requirement on the mapped length:
--outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 40
This will allow alignments with 40 or more bases matches.

Cheers
Alex

Павел Заякин

unread,
Jun 25, 2015, 8:26:41 PM6/25/15
to rna-...@googlegroups.com, pa...@biomed.lu.lv
Hi Alex!

Thank you for recomendations!
Now it look much better. :)

                          Number of input reads | 6862574
                      Average input read length | 116
                                    UNIQUE READS:
                   Uniquely mapped reads number | 3220024
                        Uniquely mapped reads % | 46.92%
                          Average mapped length | 80.48
                       Number of splices: Total | 1065705
            Number of splices: Annotated (sjdb) | 973230
                       Number of splices: GT/AG | 958682
                       Number of splices: GC/AG | 31690
                       Number of splices: AT/AC | 689
               Number of splices: Non-canonical | 74644
                      Mismatch rate per base, % | 1.34%
                         Deletion rate per base | 0.29%
                        Deletion average length | 1.16
                        Insertion rate per base | 0.32%
                       Insertion average length | 1.19
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 1827221
             % of reads mapped to multiple loci | 26.63%
        Number of reads mapped to too many loci | 8583
             % of reads mapped to too many loci | 0.13%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.27%
                 % of reads unmapped: too short | 24.63%
                     % of reads unmapped: other | 1.43%


Thank you for help,
Pawel.




четверг, 25 июня 2015 г., 19:46:49 UTC+3 пользователь Alexander Dobin написал:

Tom Harrop

unread,
Jul 1, 2015, 6:07:19 AM7/1/15
to rna-...@googlegroups.com
Hi again,

Just posting in case someone else has this issue. Alex's advice solved my problem: he noticed that it was caused by clipping at the 5' end of the reads. Because of the strong overlap of R1 and R2 in my libraries, this resulted in a "3' overhang", which causes STAR to (correctly) reject the alignments, as he explained:

I think the reason that STAR reports fewer paired reads aligned is in the definition of the "proper" pair.
STAR requires that the start of the mate on the positive strand is smaller than the start of the other mate.
This requirement comes from a simple view of the sequencing process that starts on the opposite ends of an insert, and should be true even if the insert size is smaller than the read length.

We had clipped ~9 bases at the 5' of each read. I just ran STAR with reads that had not been clipped at the 5' end:

                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 8.72%
                     % of reads unmapped: other | 0.01%



So this solved my problem.

Thanks again for the help.

Tom

Scott Youlten

unread,
Jul 8, 2015, 10:15:50 AM7/8/15
to rna-...@googlegroups.com
Hi Tom

Thanks very much for posting a follow up, I think it would make Alex's job much easier if everyone did the same overtime he answered a question haha (which is staggeringly often thanks Alex!).

I was wondering at what point exactly you were clipping your reads? Was this a process occurring during trimming (Im currently using trim galore to do this) or is this a parameter in STAR that I am daftly over looking? I am asking because I seem to be encountering a similar problem to that which you and Alex have just solved (below)? Any advice you could give would be greatly appreciated. Thanks!!

                                 Started job on | Jul 08 09:39:08

                             Started mapping on | Jul 08 09:39:29

                                    Finished on | Jul 08 09:47:45

       Mapping speed, Million of reads per hour | 146.23


                          Number of input reads | 20147145

                      Average input read length | 243

                                    UNIQUE READS:

                   Uniquely mapped reads number | 14363238

                        Uniquely mapped reads % | 71.29%

                          Average mapped length | 242.82

                       Number of splices: Total | 6402912

            Number of splices: Annotated (sjdb) | 6334140

                       Number of splices: GT/AG | 6365155

                       Number of splices: GC/AG | 32462

                       Number of splices: AT/AC | 3234

               Number of splices: Non-canonical | 2061

                      Mismatch rate per base, % | 0.14%

                         Deletion rate per base | 0.01%

                        Deletion average length | 1.79

                        Insertion rate per base | 0.01%

                       Insertion average length | 1.57

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci | 736002

             % of reads mapped to multiple loci | 3.65%

        Number of reads mapped to too many loci | 9943

             % of reads mapped to too many loci | 0.05%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches | 0.00%

                 % of reads unmapped: too short | 24.77%

                     % of reads unmapped: other | 0.23%

Tom Harrop

unread,
Jul 8, 2015, 10:38:06 AM7/8/15
to rna-...@googlegroups.com
Hi Scott,

In my case the 5' clipping was done at the same time as the adaptor trimming using cutadapt. We were experimenting with removing 9 bases from the 5' of each read to deal with the uneven 'per base sequence content' (from fastqc) at the start of each read. We didn't consider that the large overlap between R1 and R2 would cause the 3' of some reads to extend past the 5' of their pair or that this would result in alignments for these reads being rejected.

Hope this helps,

Tom

Alexander Dobin

unread,
Jul 8, 2015, 5:54:47 PM7/8/15
to rna-...@googlegroups.com, twha...@gmail.com
Hi  Scott,

generally, trimming the reads should not increase the number of "unmapped - too short" alignments, but rather decrease the number of unique mappers in favor of multi-mappers.
In Tom's example trimming combined with short insert size led to an unusual alignment configuration. It should not happen very often, but of course it's always helpful to try mapping without any trimming to check how much the results change.

What parameters have you used for mapping?
The usual suspects for low mapping rates are (i) rRNA contamination, especially if you have total RNA data and have not used non-chromosomal scaffolds for the genome generation; (ii) contamination from other species; and (iii) poor sequencing quality. The (iii) can likely be ruled out since you have very low mismatch rate. The (i) and (ii) can be checked by BLASTing a few of the unmapped reads, which you can output with --outReadsUnmapped Fastx option.

Cheers
Alex

Hubert Rehrauer

unread,
Jul 13, 2015, 2:45:23 PM7/13/15
to rna-...@googlegroups.com, twha...@gmail.com
Dear Alex

Same problem here. We do rather regularly clipping a few bases at both ends because we found that improves mapping rates in general. But this is highly problematic if the insert size is shorter than the read length and we do have paired-end. In that case we do get high number of "Unmapped reads too short". I fully understand why it happens: STAR does not consider them as properly paired, because the 3p trimming cuts a few bases from the adapter and the 5p trimming cuts away bases from the actual RNA insert.***
Can I deposit two feature requests, please?
a) Can you change the message in the Log.final.out from "too short" to "not properly paired" ??
b) Can you introduce some lag variable and let STAR accept paired alignments even if they are shifted by a few bases because of clipping?

*** Even if you do adapter trimming there are problems if there are only few adapter bases (1-4) and the adapter trimming does not remove them.

regards
Hubert

Alexander Dobin

unread,
Jul 13, 2015, 6:03:35 PM7/13/15
to rna-...@googlegroups.com, hreh...@gmail.com, twha...@gmail.com
Hi Hubert,

have you tried mapping it without any trimming? STAR will "clip" the bases on the ends that cannot be mapped, and it should not reduce mapping rate substantially.
The 3' trimming  should not cause this kind of problem, it's the 5' trimming that caused the problem for Tom. I am not sure why trimming from the 5' is needed at all.

The too short category is broad and include things that are not properly paired, but also other types of shortness. I guess I could try to split this category into more informative ones.
I could allow some tolerance for the read ends, but I think this kind of strange alignment is not good for any downstream processing. For example, how would you calculate its insert size?

Cheers
Alex

snt...@gmail.com

unread,
Aug 24, 2015, 11:14:15 AM8/24/15
to rna-star, twha...@gmail.com
Dear Alex,

I too regularly face the problem that Hubert writes about: paired-end sequencing libraries with small fragments for which the 'left' and 'right' reads of a large fraction of mates overlap to a high degree such that often after adapter and low-quality sequence removal, the filtered 'right' reads end up being to the 'left' of the left reads. I believe that these reads, even when aligning well, get reported by STAR as unmapped ('too short'). This significantly reduces mapping rates to 40%-50% from the >90% values seen with aligners such as Tophat2 and Subread subjunc.

I hope you can tweak STAR to deal with this issue.

Thanks.


On Monday, July 13, 2015 at 2:45:23 PM UTC-4, Hubert Rehrauer wrote:

Alexander Dobin

unread,
Aug 24, 2015, 6:19:00 PM8/24/15
to rna-star, twha...@gmail.com
Hi,

the problem with these reads that they are not properly paired, as far as I understand the SAM conventions.
For instance, how do you calculate the insert size for them?
I think the best approach is to merge the sequences into a single-end read before the mapping.

Another question is, why would you want to trim reads on the 5'? I think this configuration can only happen if reads are trimmed on the 5'.
Adapter sequences should only appear on the 3. Low quality sequence should not be very common on the 5', and you could simply let STAR trim it...
You could also try is to replace the trimmed bases on the 5' with N - I think this will allow STAR to soft-trim them and call them properly paired.

I guess I could introduce an option to allow for the weird end configurations, however, it feels like kicking the problem down the road.

Cheers
Alex

snt...@gmail.com

unread,
Aug 26, 2015, 11:19:27 PM8/26/15
to rna-star, twha...@gmail.com
Alex, thanks for the suggestion to merge tte PE reads to have SE reads. I may try it.

Regarding your questions, I want to be able to use pre-trimmed reads for a few reasons: e.g., sometimes, these are the only read data that I have, and sometimes, I need to use the same input with multiple RNA-seq aligners. Many times the reads are trimmed at the 5' end because of the Illumina platform's 5' nucleotide composition bias (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2896536/); sometimes this is because the sequencing did not produce data of great quality but one still wants to use it after both 5' and 3' trimming of poor quality regions of the reads.

Hubert Rehrauer

unread,
Aug 29, 2015, 4:33:38 AM8/29/15
to snt...@gmail.com, rna-star, twha...@gmail.com
Hi Scott

I can completely understand your situation. In the end my solution was to trim 3 bases from the read start, do adapter trimming at the end and additionally trim 5 bases at the end. We got high mapping rates with that. The additional fixed trimming at the end was necessary because if the contained adapter was very short some adapters were missed.

Merging the overlapping R1 and R2 reads might also be a solution but I doubt that every downstream software does handle bam files that contain mixed single-end and paired-end reads correctly.

cheers
hubert


--
You received this message because you are subscribed to a topic in the Google Groups "rna-star" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rna-star/VS3wiSciQtg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.

Felix Schlesinger

unread,
Nov 16, 2015, 8:35:58 PM11/16/15
to rna-star, Alexander Dobin
Hi,

I have come across this issue in a different context. Instead of trimming bases of the 5' end of read1, read2 may have a few adapter bases at its 3' end; if one (or more) of those randomly matches the genome, the 'not proper pair' situation described here occurs and the no alignment is reported.
I think allowing these alignments could make sense. The template length field can still be set according to the BAM spec as the number of bases from the leftmost mapped base to the rightmost mapped base of the pair. I am not sure though if the proper pair flag should be set.

Prepending a few 'N' to the beginning of read1 already allows these pairs to be aligned, even though mapPos of r1 is > r2 (on the fwd strand).

Thoughts?
  Felix

Alexander Dobin

unread,
Nov 17, 2015, 6:05:21 PM11/17/15
to rna-star, do...@cshl.edu
Hi Felix,

allowing these weird configurations might indeed be helpful in many cases, however, I am afraid it may allow output for an even bigger number of false alignments.
I was prompted to code this filter because I saw a number of these cases that were obviously wrong. So it's a question of sensitivity/precision trade-off, as usual.
In case of adapters. of course, the best strategy is to trim the adapters.
I will code the option to allow the weird alignments in the next release.

Cheers
Alex

Felix Schlesinger

unread,
Nov 18, 2015, 7:28:24 PM11/18/15
to rna-star, do...@cshl.edu
Hi,

yeah, I do not know the best answer for those types of read pairs in general either. Adapters cannot always be trimmed with 100% efficiency and in PCR or other amplicon assays all kinds of strange things can happen. Obviously not a majority use cases and not something STAR would have to support, but an option could be useful. I guess the reads would be marked as 'not proper paired'? 

Alexander Dobin

unread,
Nov 19, 2015, 5:03:42 PM11/19/15
to rna-star, do...@cshl.edu
Hi Felix,

I guess I could also make it an option to mark them as not properly paired or not.

Cheers
Alex

Felix Schlesinger

unread,
Nov 19, 2015, 5:16:26 PM11/19/15
to rna-star, do...@cshl.edu
The BAM standard is very vague about what exactly 'proper paired' means, so I think either way is possible. Since the reads do not exactly match the orientation that is expected from a regular pair not setting the flag seems reasonable to me.

Felix

marc.rober...@gmail.com

unread,
Jan 7, 2016, 12:30:53 PM1/7/16
to rna-star, pa...@biomed.lu.lv
Hi Alex,


First of all, Happy New Year !

I am working on tumor RNA-Seq data and I am facing the same issue.
I have 50 million of PE reads in input and with the default settings of STAR, only 0.2% are mapped.
Using your suggestion (--outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 40), I reach 99% of mapping reads but only 26% being uniquely mapped, the rest being multi-mapped).
I want to perform HTSeq count on the STAR output, and my guess is that it is exclusively going to use the uniquely mapping reads and 26% is not much...
I think if I loosen the constraint as you suggested even more, I am just going to increase the number of multi-mapping reads, so I have 3 questions:

- How do you chose the best constraints according to your input read length (100) and your average number of mapping bp (80 -> 50)?
- In that context, how could I increase the proportion of uniquely mapping reads?
- How to know that my STAR output is not completely crap if I loosen too much the constraints..?

Thank you very much in advance for your feedback,


Best,
Marc

Alexander Dobin

unread,
Jan 7, 2016, 3:20:24 PM1/7/16
to rna-star, pa...@biomed.lu.lv
Hi Marc,

this is a frequent question, the most recent discussion is here https://groups.google.com/d/msg/rna-star/cvaIAgpCuXQ/Yf6JsPOvCAAJ
My philosophy is that, rather than trying to tweak mapping parameters to increase the mapping, it's better to understand why the mapping rate is so low.

In your case, mapping rate of 0.2% is very small, so it's likely that there is some serious problem with the data - most frequently it's a problem with paired ends.
For instance, the inconsistent order of reads in two files could cause it.

First thing I would recommend is to map read1 and read2 separately. If the mapping rate gets better (please post Log.final.out files), it will tell us that this is a problem with paired files.

Cheers
Alex

marc.rober...@gmail.com

unread,
Jan 8, 2016, 12:49:59 PM1/8/16
to rna-star, pa...@biomed.lu.lv
Hi Alex, 

Thank you very much for your feedback !
You were right, when mapping R1 and R2 separately, here is the result :

 Started job on |       Jan 08 12:18:38

                             Started mapping on |       Jan 08 12:24:07

                                    Finished on |       Jan 08 12:40:14

       Mapping speed, Million of reads per hour |       186.75


                          Number of input reads |       50162442

                      Average input read length |       50

                                    UNIQUE READS:

                   Uniquely mapped reads number |       45580199

                        Uniquely mapped reads % |       90.87%

                          Average mapped length |       49.86

                       Number of splices: Total |       2135536

            Number of splices: Annotated (sjdb) |       2057033

                       Number of splices: GT/AG |       2111366

                       Number of splices: GC/AG |       16160

                       Number of splices: AT/AC |       1618

               Number of splices: Non-canonical |       6392

                      Mismatch rate per base, % |       0.31%

                         Deletion rate per base |       0.01%

                        Deletion average length |       1.41

                        Insertion rate per base |       0.01%

                       Insertion average length |       1.25

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       3364739

             % of reads mapped to multiple loci |       6.71%

        Number of reads mapped to too many loci |       231406

             % of reads mapped to too many loci |       0.46%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.00%

                 % of reads unmapped: too short |       1.40%

                     % of reads unmapped: other |       0.57%


So sorting the fastq files before running STAR will fix this..?

Thank you very much,

Marc

Alexander Dobin

unread,
Jan 8, 2016, 1:42:28 PM1/8/16
to rna-star, pa...@biomed.lu.lv
Hi Marc,

if both of the reads have such good mapping statistics when mapped separately, then it's likely that the order of reads was mixed map.
You can check that by looking at the first few lines of your fastq file - the names of the reads (line 1,5,9,...)  should be exactly the same in both files, except for possible /1 /2 markers at the end.
If the names are not the same, than the read order is wrong and you need to fix it.
There are some recommendations how to do it here: https://groups.google.com/d/msg/rna-star/FCGeTgApDhU/leIW8_QTMwMJ

Cheers
Alex

Alexander Dobin

unread,
Mar 3, 2016, 6:10:56 PM3/3/16
to rna-star, do...@cshl.edu
Hi All,

I have implemented the option to allow output of the PE alignments which "protruding" mates ends, i.e. start (end) of the +strand mate downstream of the start (end) of the -strand mate.
Please check it out from GitHub master with
--alignEndsProtrude 10 ConcordantPair
where first number is the max number of protruding bases, and the second word ConcordantPair (or DiscordantPair) sets (or not) the concordance bit 0x2 in the SAM flag.

Cheers
Alex

Maryam Labaf

unread,
Oct 17, 2018, 8:53:37 AM10/17/18
to rna-star
Hi everyone.

I am doing my first RNA-seq alignment and I used STAR. For the indexing I used ENSEMBLE and I used all the default STAR mapping parameters for my data, and I got 45 % uniquely mapped and ~50% not mapped too short. I changed some of the STAR mapping parameters in the following:


module load gcc/8.1.0 star/2.5.3a
STAR --runThreadN 12 --seedSearchStartLmax 50 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0 \
--genomeDir /path/to/STAR_index \
--readFilesIn /path/to/fasq_R1   /path/to/fastq_R2.

And I got 66% uniquely mapped and 34% mapped to multiple loci. The read length of the fastq files are 51 bp, and the fastQC files I have attached here. From the fastqc file it seems there is no adaptor that I need to trim, and when I used trim_galore, It trimed some of the standard adaptors but it did not improve the mapping rates.  

Would any one here can help me about it? 
Also the Log.final.out for no trimming has been attached.

                                Started job on |       Oct 16 00:34:49

                             Started mapping on |       Oct 16 00:35:49

                                    Finished on |       Oct 16 00:53:06

       Mapping speed, Million of reads per hour |       132.64


                          Number of input reads |       38208727

                      Average input read length |       102

                                    UNIQUE READS:

                   Uniquely mapped reads number |       25108416

                        Uniquely mapped reads % |       65.71%

                          Average mapped length |       81.28

                       Number of splices: Total |       5025010

            Number of splices: Annotated (sjdb) |       4895898

                       Number of splices: GT/AG |       4960736

                       Number of splices: GC/AG |       38904

                       Number of splices: AT/AC |       3710

               Number of splices: Non-canonical |       21660

                      Mismatch rate per base, % |       0.72%

                         Deletion rate per base |       0.01%

                        Deletion average length |       1.12

                        Insertion rate per base |       0.00%

                       Insertion average length |       1.18

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       12930774

             % of reads mapped to multiple loci |       33.84%

        Number of reads mapped to too many loci |       165424

             % of reads mapped to too many loci |       0.43%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.00%

                 % of reads unmapped: too short |       0.00%

                     % of reads unmapped: other |       0.01%

                                  CHIMERIC READS:

                       Number of chimeric reads |       0

                            % of chimeric reads |       0.00%



Thanks in advance!

 
On Friday, June 12, 2015 at 12:46:10 PM UTC-4, Tom Harrop wrote:
Hi everyone,

I am getting a high percentage of "reads unmapped: too short" (25–35%) when mapping some paired-end Illumina reads.

The reads are 2x125 bp so ~110 bp each after trimming etc.

Here is how I used STAR:
STAR --runThreadN 4 --genomeDir $index_dir \
   
--readFilesIn $R1 $R2 --outFileNamePrefix $path_to_out/$library_name. \
   
--outSAMtype BAM Unsorted

I have also (separately) tried the parameters:
--alignSplicedMateMapLmin 50
and
--alignSplicedMateMapLminOverLmate 0.2
but this doesn't make a difference.

My mean insert size is only about 130 bp as determined by bwa-mem (and in agreement with the Bioanalyzer traces) so I expect a large amount of overlap between the read pairs. I wonder if this could be part of my problem, since STAR is reporting the read length as 225 bp?

The genome was created with:
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir $path_to_out/star/index \
   
--genomeFastaFiles $fasta --sjdbGTFfile $gff \
   
--sjdbGTFtagExonParentTranscript Parent --sjdbOverhang 109

I am using STAR 2.4.1d from github:
STAR_2.4.1d_modified

I would be grateful for any suggestions. FWIW I get a high percentage (~95%) of concordant mapping with tophat2 for the same libraries, but I don't want to use tophat2.

Thanks for reading,

Tom Harrop
IRD, Montpellier, France.
fastqc_report.html

Alexander Dobin

unread,
Oct 19, 2018, 1:44:12 PM10/19/18
to rna-star
Hi Maryam,

what is the Log.final.out output with default parameters?
Couple of things to try:
1. Map the read1 and read2 separately
2. BLAST a few of the unmapped reads to see if there is contamination by other species.

Cheers
Alex

Maryam Labaf

unread,
Nov 26, 2018, 2:53:42 PM11/26/18
to rna-star
Hi 

Hope you are doing well. I have a question about improving the uniquely mapped rate for the samples from urine. 
In general the data are not that much good data and I do not expect to get a very high uniquely rates. However, I am trying to get the best our of it.

Here are some of the information that I used for indexing and mapping in STAR.

Indexing scripts:
GENOME_DIR=“/path/to/urine2_index"
GENOME=“/path/to/genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
GTFDIR=“/path/to/gtf/Homo_sapiens.GRCh38.94.gtf"

STAR --runThreadN 50 --runMode genomeGenerate \
--genomeDir ${GENOME_DIR} \
--genomeFastaFiles ${GENOME} \
--sjdbGTFfile ${GTFDIR} \
--sjdbOverhang 50

And the STAR_mapping scripts:
module unload star
module load gcc/8.1.0 star/2.5.3a

GENOMEDIR=“/path/to/urine2_index"
GTF_DIR="/home/ml98b/RNA_seq_test/gtf/Homo_sapiens.GRCh38.94.gtf"
FA1=“/path/to/urine2_fastq_trimed/122_R_GTCCGC_R1_val_1.fq.gz"
FA2=“/path/to/urine2_fastq_trimed/122_R_GTCCGC_R2_val_2.fq.gz"

STAR --runThreadN 18 \
--genomeDir $GENOMEDIR \
--readFilesCommand zcat \
--outFilterMultimapNmax 20 --alignSJoverhangMin 8 \
--outSAMattributes NH --outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.3 --alignIntronMin 20 \
--alignMatesGapMax 1000000 --alignIntronMax 1000000  \
--outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --outFilterMatchNmin 0 \
--readFilesIn ${FA1}  ${FA2} \
--outSAMtype BAM SortedByCoordinate

A part of Log.progress.out file is:

           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Nov 26 14:24:18      5.0      256910       58    49.1%     40.4     3.1%    48.8%     0.6%     0.0%     1.5%     0.0%
Nov 26 14:25:29     18.2     1291114       58    49.3%     40.4     3.1%    48.7%     0.6%     0.0%     1.4%     0.0%
Nov 26 14:26:45     22.4     2062274       58    49.3%     40.4     3.1%    48.6%     0.6%     0.0%     1.5%     0.0%
Nov 26 14:27:55     32.4     3608011       58    49.3%     40.4     3.1%    48.6%     0.6%     0.0%     1.5%     0.0%
Nov 26 14:28:56     46.2     5923751       58    49.3%     40.4     3.1%    48.6%     0.6%     0.0%     1.5%     0.0%
Nov 26 14:29:56     49.7     7211014       58    49.3%     40.4     3.1%    48.6%     0.6%     0.0%     1.5%     0.0%
Nov 26 14:31:26     48.0     8157479       58    49.3%     40.4     3.1%    48.6%     0.6%     0.0%     1.5%     0.0%
Nov 26 14:32:36     53.9    10211334       58    49.3%     40.4     3.1%    48.6%     0.6%     0.0%     1.5%     0.0%

And attached is the Fastq before and after trimming. Now, I am looking if there is any other parameters that I can modify to get a little higher than 50% uniquely mapped. I really appreciate your help. Thank you.


Regards,
Maryam
122_R_GTCCGC_R1_fastqc.html
122_R_GTCCGC_R1_val_1_fastqc.html

Alexander Dobin

unread,
Dec 1, 2018, 10:41:50 AM12/1/18
to rna-star
Hi Maryam,

most of the reads that do not map as unique mappers map as multimappers, only 1.5% of the reads do not map at all.
Multimappers cannot be made into unique mappers by changing mapping parameters. 
Since you are using relaxed parameters for mapped length: 
--outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33
these multimappers are likely to be short pieces of the reads.

I think the only way to deal with it on the computational side is to include multimappers in your downstream analysis. For instance, in short RNA-seq analysis the common approach is to consider only one randomly chosen alignment for each multimapper.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages